Comparison and Adaptation of Automatic Evaluation Metrics for Quality Assessment of Re-Speaking

Krzysztof Wołk; Danijel Koržinek

doi:10.7494/csci.2017.18.2.129

Authors

Krzysztof Wołk Polish-Japanese Academy of Information Technology
Danijel Koržinek Polish-Japanese Academy of Information Technology

DOI:

https://doi.org/10.7494/csci.2017.18.2.129

Keywords:

evaluation metrics

Abstract

Re-speaking is a mechanism for obtaining high quality subtitles for use in live
broadcast and other public events. Because it relies on humans performing the
actual re-speaking, the task of estimating the quality of the results is non-trivial.
Most organisations rely on humans to perform the actual quality assessment,
but purely automatic methods have been developed for other similar problems,
like Machine Translation. This paper will try to compare several of these
methods: BLEU, EBLEU, NIST, METEOR, METEOR-PL, TER and RIBES.
These will then be matched to the human-derived NER metric, commonly used
in re-speaking.

Citations

Citation Indexes: 3

Captures

Readers: 31

see details

Downloads

References

Axelrod A.: Factored language model for statistical machine translation, 2006.

Banerjee S., Lavie A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, vol. 29, pp. 65–72. 2005.

Doddington G.: Automatic evaluation of machine translation quality using n- gram co-occurrence statistics. In: Proceedings of the second international confer- ence on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., 2002.

Frost J.: Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables. In: The Minitab Blog, http://blog. minitab. com/blog/adventures-in-statistics/multiple- regessionanalysis-use-adjusted-r-squared-and-predicted-r-squared-to-include-the- correctnumber-of-variables, 2013.

European Federation of Hard of Hearing People u.: State of subtitling access in EU. 2011 Report. http://ec.europa.eu/internal_market/ consultations/2011/audiovisual/non-registered-organisations/ european-federation-of-hard-of-hearing-people-efhoh-_en.pdf, 2011. [Online; accessed Jan. 30., 2016.].

Hovy E.: Toward finely differentiated evaluation metrics for machine translation. In: Proceedings of the EAGLES Workshop on Standards and Evaluation. Pisa, Italy. 1999.

Isozaki H., Hirao T., Duh K., Sudoh K., Tsukada H.: Automatic evaluation of translation quality for distant language pairs. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 944–952. Association for Computational Linguistics, 2010.

Kim J.O., Mueller C.W.: Standardized and unstandardized coefficients in causal analysis An expository note. In: Sociological Methods & Research, vol. 4(4), pp. 423–438, 1976.

Koehn P., Hoang H., Birch A., Callison-Burch C., Federico M., Bertoldi N., Cowan B., Shen W., Moran C., Zens R., et al.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp. 177–180. Association for Computational Linguistics, 2007.

Maziarz M., Piasecki M., Szpakowicz S.: Approaching plWordNet 2.0. In: Pro- ceedings of the 6th Global Wordnet Conference, Matsue, Japan, pp. 50–62. 2012.

Miller G.A.: WordNet: a lexical database for English. In: Communications of the ACM, vol. 38(11), pp. 39–41, 1995.

Papineni K., Roukos S., Ward T., Zhu W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.

Porter M.F.: Snowball: A language for stemming algorithms, 2001.

Reeder F.: Additional mt-eval references. In: International Standards for Language Engineering, Evaluation Working Group, 2001.

Romero-Fresco P., Mart́ınez J.: Accuracy rate in live subtitling–the NER model. In: , 2011.

Seber G.A., Lee A.J.: Linear regression analysis, vol. 936. John Wiley & Sons, 2012.

Szarkowska A., Dutka ., Chmiel A., Brocki ., Krejtz K., Marasek K.: Are interpreters better respeakers? An exploratory study on respeaking competences. 2015.

Wolinski M., Milkowski M., Ogrodniczuk M., Przepiórkowski A.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: LREC, pp. 860–864. 2012.

Wolk K., Marasek K.: Polish–English Speech Statistical Machine Translation Systems for the IWSLT 2013. In: Proceedings of the 10th International Workshop on Spoken Language Translation, Heidelberg, Germany, pp. 113–119. 2013.

Wolk K., Marasek K.: Enhanced Bilingual Evaluation Understudy. In: Lecture Notes on Information Theory Vol, vol. 2(2), 2014.

Zimmerman D.W.: Teachers corner: A note on interpretation of the paired- samples t test. In: Journal of Educational and Behavioral Statistics, vol. 22(3), pp. 349–360, 1997.