Saltar a la navegació principal Saltar a la cerca Vés al contingut principal

Hierarchical multimodal transformers for Multipage DocVQA

Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny

Producció científica: Contribució a revistaArticleRecercaAvaluat per experts

Resum

Existing work on DocVQA only considers single-page documents. However, in real applications documents are mostly composed of multiple pages that should be processed altogether. In this work, we propose a new multimodal hierarchical method Hi-VT5, that overcomes the limitations of current methods to process long multipage documents. In contrast to previous hierarchical methods that focus on different semantic granularity (He et al., 2021) or different subtasks (Zhou et al., 2022) used in image classification. Our method is a hierarchical transformer architecture where the encoder learns to summarize the most relevant information of every page and then, the decoder uses this summarized representation to generate the final answer, following a bottom-up approach. Moreover, due to the lack of multipage DocVQA datasets, we also introduce MP-DocVQA, an extension of SP-DocVQA where questions are posed over multipage documents instead of single pages. Through extensive experimentation, we demonstrate that Hi-VT5 is able, in a single stage, to answer the questions and provide the page that contains the answer, which can be used as a kind of explainability measure.
Idioma originalAnglès
Número d’article109834
RevistaPattern Recognition
Volum144
DOIs
Estat de la publicacióPublicada - de des. 2023

Fingerprint

Navegar pels temes de recerca de 'Hierarchical multimodal transformers for Multipage DocVQA'. Junts formen un fingerprint únic.

Com citar-ho