Word-Wise Thai and Roman Script Identification

Sukalpa Chanda, Umapada Pal, Oriol Ramos Terrades

Research output: Contribution to journalArticleResearchpeer-review

11 Citations (Scopus)

Abstract

In some Thai documents, a single text line of a printed document page may contain words of both Thai and Roman scripts. For the Optical Character Recognition (OCR) of such a document page it is better to identify, at first, Thai and Roman script portions and then to use individual OCR systems of the respective scripts on these identified portions. In this article, an SVM-based method is proposed for identification of word-wise printed Roman and Thai scripts from a single line of a document page. Here, at first, the document is segmented into lines and then lines are segmented into character groups (words). In the proposed scheme, we identify the script of a character group combining different character features obtained from structural shape, profile behavior, component overlapping information, topological properties, and water reservoir concept, etc. Based on the experiment on 10,000 data (words) we obtained 99.62% script identification accuracy from the proposed scheme.
Original languageEnglish
Article number11
Number of pages21
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume8
Issue number3
DOIs
Publication statusPublished - 1 Aug 2009

Fingerprint

Dive into the research topics of 'Word-Wise Thai and Roman Script Identification'. Together they form a unique fingerprint.

Cite this