Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Krzysztof Wołk


Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems, some NLP tools, and any other text processing tasks requiring bilingual data. This research proposes a language-independent bisentence filtering approach based on Polish (not a position-sensitive language) to English experiments. This cleaning approach was developed on the TED Talks corpus and also initially tested on the Wikipedia comparable corpus, but it can be used for any text domain or language pair. The proposed approach implements various heuristics for sentence comparison. Some of the heuristics leverage synonyms as well as semantic and structural analysis of text as additional information. Minimization of data loss has been ensured. An improvement in MT system scores with text processed using this tool is discussed.


statistical machine translation; NLP; comparable corpora; text filtering

Full Text:



Axelrod A.: Factored Language Models for Statistical Machine Translation. University of Edinburgh, 2006.

Banerjee S., Lavie A.: METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Proceedings of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, Ann Arbor, 2005.

Berrotarán G., Carrascosa R., Vine A.: Yalign documentation,

Brocki Ł., Marasek K., Koržinek D.: Multiple Model Text Normalization for the Polish Language. In: Foundations of Intelligent Systems, pp. 143–148, Macau, China, 2012. 20th International Symposium, ISMIS 2012.

Brown P., Lai J., Mercer R.: Aligning sentences in Parallel Corpora. In: Proceedings of 20th Annual Meeting of the ACL, pp. 169–176, Berkeley, 1991.

Cetollo M., Bertoldi N., Federico M.: Methods for Smoothing the Optimizer Instability in SMT. In: Proceedings of Machine Translation Summit XIII, Xiamen, China, 2011.

Cettolo M., Girardi C., Federico M.: Wit3: Web inventory of transcribed and translated talks. In: Proceedings of 16th Conference of the European Association for Machine Translation (EAMT), pp. 261–268, Trento, Italy, 2012.

Costa-Jussa M., Fonollosa J.: Using linear interpolation and weighted reordering hypotheses in the Moses system, 2010, Barcelona, Spain.

Deng Y., Kumar S., Byrne W.: Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, vol. 12(4), pp. 1–26, 2006.

Doddington G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In: Proceedings of Second International Conference on Human Language Technology (HLT), pp. 138–145, San Diego, 2002. Rese82.81arch 2002.

Fraser A., Braune F.: Improved Unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora. Coling 2010: Poster Volume, pp. 81–89, 2010.

Gale W., Church K.: Identifying word correspondences in parallel texts. In: Proceedings of DARPA Workshop on Speech and Natual Language, pp. 152–157, 1991.

Gao Q., Vogel S.: Parallel Implementations of Word Alignment Tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49–57, 2008.

Han A., Wong D., Chao L.: LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. In: Proceedings of Coling 2012, Mumbai, 2012.

Heafield K.: Ken L. M.: Faster and smaller language model queries. In: Proceedings of Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.

Isozaki H.: Automatic Evaluation of Translation Quality for Distant Language Pairs. In: Proceedings of 2010 Conference on Empirical Methods in Natural Language Processing, pp. 9–11, MIT, Massachusetts, USA, 2010.

IWSLT: Evaluation Campaign, 2014,

Joachims T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Lecture Notes in Computer Science, vol. 1398, pp. 137–142, 2005.

Koehn P., Hoang H., Birch A., Callison-Burch C., Federico M., Bertoldi N., Cowan B., Shen W., Moran C., Zens R., Dyer C., Bojar R., Constantin A., Herbst E.: Moses: Open Source Toolkit for Statistical Machine Translation. In: Proceedings of the ACL 2007 Demo and Poster Sessions, pp. 177–180, Prague, 2007.

Musso G.: Sequence Alignment (Needleman-Wunsch, Smith-Waterman),

Oliver J.: Global Autonomous Language Exploitation (GALE). DARPA/IPTO Proposer Information Pamphlet, 2005.

Papineni K., Rouskos S., Ward T., Zhu W.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, 2002.

Santos A.: A survey on parallel corpora alignment, MI-STAR 2011, pp. 117–128, 2011.

Schmidt A.: Statistical Machine Translation Between New Language Pairs Using Multiple Intermediaries. Doctoral dissertation, Thesis, 2007.

Snover M., Dorr B., Schwartz R., Micciulla L., Makhoul J.: A Study of Translation Edit Rate with Targeted Human Annotation. In: Proceedings of 7th Conference of the Association for Machine Translation in the Americas, Cambridge, 2006.

Specia L., Raj D., Turchi M.: Machine translation evaluate versus quality estimation. Machine Translation, vol. 24, pp. 39–50, 2010.

Stolcke A.: SRILM – An Extensible Language Modeling Toolkit. In: Proceedings of 7th International Conference on Spoken Language Processing, Denver, USA, 2002.

Thorleuchter D., Van den Poel D.: Web Mining based Extraction of Problem Solution Ideas. Expert Systems with Applications, vol. 40(10), pp. 3961–3969, 2013.

Varga D.: Parallel corpora for medium density languages. In: Proceedings of the RANLP 2005, pp. 590–596, 2005.

Wolk K., Marasek K.: Polish–English Speech Statistical Machine Translation Systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, pp. 113–119, Heidelberg, Germany, 2013.

Wołk K., Marasek K.: Advances in Intelligent Systems and Computing, vol. 275, chap. Real-Time Statistical Speech Translation, pp. 107–114. Springer, Madeira Island, Portugal, 2014.

Wołk K., Marasek K.: Advances in Intelligent Systems and Computing, vol. 275, chap. A Sentence Meaning Based Alignment Method for Parallel Text Corpora Preparation, pp. 107–114. Springer, Madeira Island, Portugal, 2014.

Wong B., Kit C.: Word choice and work position for automatic MT evaluation. In: Workshop: Metrics of the Association for Machine Translation in the Americas. Waikiki, 2008.

Wu D., Fung P.: Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora. In: Natural Language Processing – IJCNLP 2005: Lecture Notes in Computer Science, vol. 3651, pp. 257–268, 2005.



  • There are currently no refbacks.