Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training

Antoni Oliver, Sergi Álvarez

Producció científica: Capítol de llibreCapítolRecercaAvaluat per experts

2 Cites (Scopus)

Resum

There are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required ones and re-scoring the segments using SBERT. A use case on the Spanish–Asturian and Spanish–Catalan CCMatrix corpus is presented.
Idioma originalAnglès
Títol de la publicacióProceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023
EditorsMary Nurminen, Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartin, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
EditorEuropean Association for Machine Translation
Pàgines39-45
Nombre de pàgines7
ISBN (electrònic)9789520329471
Estat de la publicacióPublicada - 2023

Sèrie de publicacions

NomProceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023

Fingerprint

Navegar pels temes de recerca de 'Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training'. Junts formen un fingerprint únic.

Com citar-ho