TY - CHAP
T1 - Filtering and rescoring the CCMatrix corpus for Neural Machine Translation training
AU - Oliver, Antoni
AU - Álvarez, Sergi
N1 - Publisher Copyright:
© 2023 The authors. This article is licensed under a Creative Commons 4.0 licence, no derivative works, attribution, CC-BY-ND.
PY - 2023
Y1 - 2023
N2 - There are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required ones and re-scoring the segments using SBERT. A use case on the Spanish–Asturian and Spanish–Catalan CCMatrix corpus is presented.
AB - There are several parallel corpora available for many language pairs, such as CCMatrix, built from mass downloads of web content and automatic detection of segments in one language and the translation equivalent in another. These techniques can produce large parallel corpora, but of questionable quality. In many cases, the segments are not in the required languages, or if they are, they are not translation equivalents. In this article, we present an algorithm for filtering out the segments in languages other than the required ones and re-scoring the segments using SBERT. A use case on the Spanish–Asturian and Spanish–Catalan CCMatrix corpus is presented.
UR - http://www.scopus.com/inward/record.url?scp=85184805763&partnerID=8YFLogxK
M3 - Chapter
AN - SCOPUS:85184805763
T3 - Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023
SP - 39
EP - 45
BT - Proceedings of the 24th Annual Conference of the European Association for Machine Translation, EAMT 2023
A2 - Nurminen, Mary
A2 - Nurminen, Mary
A2 - Brenner, Judith
A2 - Koponen, Maarit
A2 - Latomaa, Sirkku
A2 - Mikhailov, Mikhail
A2 - Schierl, Frederike
A2 - Ranasinghe, Tharindu
A2 - Vanmassenhove, Eva
A2 - Vidal, Sergi Alvarez
A2 - Aranberri, Nora
A2 - Nunziatini, Mara
A2 - Escartin, Carla Parra
A2 - Forcada, Mikel
A2 - Popovic, Maja
A2 - Scarton, Carolina
A2 - Moniz, Helena
PB - European Association for Machine Translation
ER -