Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, C. V. Jawahar, Ernest Valveny, DImosthenis Karatzas

Producció científica: Capítol de llibreCapítolRecercaAvaluat per experts

199 Cites (Scopus)

Resum

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the Visual Question Answering process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Idioma originalAnglès
Títol de la publicacióProceedings - 2019 International Conference on Computer Vision, ICCV 2019
EditorInstitute of Electrical and Electronics Engineers Inc.
Pàgines4290-4300
Nombre de pàgines11
ISBN (electrònic)9781728148038
DOIs
Estat de la publicacióPublicada - d’oct. 2019

Sèrie de publicacions

NomProceedings of the IEEE International Conference on Computer Vision
Volum2019-October
ISSN (imprès)1550-5499

Fingerprint

Navegar pels temes de recerca de 'Scene text visual question answering'. Junts formen un fingerprint únic.

Com citar-ho