Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Berke Oral, Gülşen Eryiğit*

*Bu çalışma için yazışmadan sorumlu yazar

Araştırma sonucu: ???type-name???Makalebilirkişi

6 Atıf (Scopus)

Özet

The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.

Orijinal dilİngilizce
Sayfa (başlangıç-bitiş)187-205
Sayfa sayısı19
DergiInternational Journal on Document Analysis and Recognition
Hacim25
Basın numarası3
DOI'lar
Yayın durumuYayınlandı - Eyl 2022

Bibliyografik not

Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

Finansman

This work is funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and by Yapı Kredi Technology with a TUBITAK 1505 (University - Industry Cooperation Support Program) project Grant No. 5190073.

FinansörlerFinansör numarası
TUBITAK
TUBITAK 15055190073
Yapı Kredi Technology
Türkiye Bilimsel ve Teknolojik Araştirma Kurumu

    Parmak izi

    Fusion of visual representations for multimodal information extraction from unstructured transactional documents' araştırma başlıklarına git. Birlikte benzersiz bir parmak izi oluştururlar.

    Alıntı Yap