Testing word similarity: Language independent approach with examples from romance

Mikhail Alexandrov, Xavier Blanco, Pavel Makagonov

Research output: Contribution to journalArticleResearchpeer-review

2 Citations (Scopus)

Abstract

Identification of words with the same basic meaning (stemming) has important applications in Information Retrieval, first of all for constructing word frequency lists. Usual morphologically-based approaches (including the Porter stemmers) rely on language-dependent linguistic resources or knowledge, which causes problems when working with multilingual data and multithematic document collections. We suggest several empirical formulae with easy to adjust parameters and demonstrate how to construct such formulae for a given language using an inductive method of model self-organization. This method considers a set of models (formulae) of a given class and selects the best ones using training and test samples. We describe the method and give detailed examples for French, Italian, Portuguese, and Spanish. The formulae are examined on real domain-oriented document collections. Our approach can be easily applied to other European languages. © Springer-Verlag 2004.
Original languageEnglish
Pages (from-to)229-241
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3136
Publication statusPublished - 1 Dec 2004

Fingerprint

Dive into the research topics of 'Testing word similarity: Language independent approach with examples from romance'. Together they form a unique fingerprint.

Cite this