Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Berke Oral, Gülşen Eryiğit*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

The importance of automated document understanding in terms of today’s businesses’ speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.

Original languageEnglish
Pages (from-to)187-205
Number of pages19
JournalInternational Journal on Document Analysis and Recognition
Volume25
Issue number3
DOIs
Publication statusPublished - Sept 2022

Bibliographical note

Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature.

Funding

This work is funded by the Scientific and Technological Research Council of Turkey (TUBITAK) and by Yapı Kredi Technology with a TUBITAK 1505 (University - Industry Cooperation Support Program) project Grant No. 5190073.

FundersFunder number
TUBITAK
TUBITAK 15055190073
Yapı Kredi Technology
Türkiye Bilimsel ve Teknolojik Araştirma Kurumu

    Keywords

    • Complex relation extraction
    • Document understanding
    • Information extraction
    • Information fusion
    • Unstructured documents
    • Visual representations

    Fingerprint

    Dive into the research topics of 'Fusion of visual representations for multimodal information extraction from unstructured transactional documents'. Together they form a unique fingerprint.

    Cite this