Video Text Reading Competition for Dense and Small Text

Dataset

Description

Video text spotting [1] has received increasing attention due to its numerous applications in computer vision, e.g., video understanding, video retrieval, video text translation, and license plate recognition, etc. But the current video text spotting almost remain at a standstill for the lack of practical datasets and effective method


There already exist some video text spotting benchmarks ICDAR2015 (Text in Videos) [2], YouTube Video Text (YVT) [3], RoadText-1K [4], BOVText[5], which focus on common text cases (e.g., normal size, density) and single scenario while ignoring extreme video texts challenges, i.e., dense and small text in various scenarios.

This challenge focuses on dense and small text reading(DSText) challenges in the video with various scenarios.

Most existing algorithms and benchmarks focus on common text cases (e.g., normal size, density) and single scenarios, while ignoring extreme video texts challenges, i.e., dense and small text in various scenarios. In this competition, we establish a video text reading benchmark, named DSText, which focuses on dense and small text reading challenge in the video with various scenarios. Compared with the previous datasets, the proposed dataset mainly include three new challenges: 1) Dense video texts, new challenge for video text spotter. 2) High-proportioned small texts. 3) Various new scenarios, e.g., ‘Game’, ‘Sports’, etc.

Besides, similar to ICDAR2015 for video text spotting challenge , DSText also presents some common technological challenges for video text. For example, the quality of the image is generally worse than static images, due to motion blur and out of focus issues, while video compression might create further artefacts. And how to take advantage of the useful temporal information in the video for effective video text spotting also remain an unsolved challenge.

To reduce the cost (GPU computational expense) of algorithm research in the community, we select and annotate various short sequences with around 15 seconds, which include massive text (around 23.5 texts per frame) in 11 open real-life scenarios and an ”Unknown” scenario.
Date made available2 Feb 2023
PublisherComputer Vision Center - Robust Reading Competition Portal
Date of data production2 Feb 2023

Cite this