TY - CHAP
T1 - InfographicVQA
AU - Mathew, Minesh
AU - Bagal, Viraj
AU - Tito, Ruben
AU - Karatzas, Dimosthenis
AU - Valveny, Ernest
AU - Jawahar, C. V.
N1 - Funding Information:
This work is supported by MeitY, Government of India, the CERCA Programme / Generalitat de Catalunya and project PID2020-116298GB-I0.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding of infographic images by using a Visual Question Answering technique. To this end, we present InfographicVQA, a new dataset comprising a diverse collection of infographics and question-answer annotations. The questions require methods that jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with an emphasis on questions that require elementary reasoning and basic arithmetic skills. For VQA on the dataset, we evaluate two Transformer-based strong baselines. Both the baselines yield unsatisfactory results compared to near perfect human performance on the dataset. The results suggest that VQA on infographics - images that are designed to communicate information quickly and clearly to human brain - is ideal for benchmarking machine understanding of complex document images. The dataset is available for download at docvqa.org
AB - Infographics communicate information using a combination of textual, graphical and visual elements. This work explores the automatic understanding of infographic images by using a Visual Question Answering technique. To this end, we present InfographicVQA, a new dataset comprising a diverse collection of infographics and question-answer annotations. The questions require methods that jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with an emphasis on questions that require elementary reasoning and basic arithmetic skills. For VQA on the dataset, we evaluate two Transformer-based strong baselines. Both the baselines yield unsatisfactory results compared to near perfect human performance on the dataset. The results suggest that VQA on infographics - images that are designed to communicate information quickly and clearly to human brain - is ideal for benchmarking machine understanding of complex document images. The dataset is available for download at docvqa.org
KW - Document Analysis Datasets
KW - Evaluation and Comparison of Vision Algorithms
KW - Vision and Languages
UR - http://www.scopus.com/inward/record.url?scp=85126124388&partnerID=8YFLogxK
U2 - 10.1109/WACV51458.2022.00264
DO - 10.1109/WACV51458.2022.00264
M3 - Chapter
AN - SCOPUS:85126124388
T3 - Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
SP - 2582
EP - 2591
BT - Proceedings - 2022 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022
PB - Institute of Electrical and Electronics Engineers Inc.
ER -