RESTORING TONE-MARKS IN STANDARD YORÙBÁ ELECTRONIC TEXT: IMPROVED MODEL
DOI:
https://doi.org/10.7494/csci.2017.18.3.2128Keywords:
diacritic restoration, syllables, characters, word-level accuracyAbstract
Diacritic Restoration is a necessity in the processing of languages with Latinbased scripts that utilizes letters outside the basic Latin alphabet used by English language. Yorùbá is one such languages, marking underdot (dot-below)on three characters and tone marks on all seven vowels and two syllabic nasals. The problem of restoring underdotted characters has been fairly addressed using character as linguistic units for restoration. However, the existing characterbased approaches and word-based approach has not been able to sufficiently address restoration of tone marks in Yorùbá. We address in this study tone marks restoration as a subset of diacritic restoration.We proposed using the syllable (derived from word) as the linguistic token for tone marks restoration. In our experimental setup, we used Yoruba text collected from various sources as data with total word count of 250,336 words. These words, on syllabification, yielded 464,274 syllables. The syllables were divided into training and testing data in different proportions ranging from 99% used for training and 1% used for testing to 70% used for training and 30% used for testing. The aim of evaluation different proportions was to determine how the ratio of training-to-test data affect the variations that may occur in the result. We applied Memory-based learning to train the models. We also set up a similar experiment using character token to be able to compare the performance.
The result showed that using syllable was able to increase accuracy at word level to 96.23% and an average of almost 15% over that gotten from using character. We also found out that using 75% of data for training and the remaining 25% for testing gives the results with the least variation in a ten-fold cross validation test. Hybridizing the syllable „based approach with other methods like lexicon lookup might likely lead to improvement over the current result.
Downloads
References
Adegbola T., Odilinye L.U.: Quantifying the effect of corpus size on the quality of automatic diacritization of Yorábç texts. In: Proceedings of 3rd international Workshop on Spoken Languages Technologies for Under-resourced Languages. Cape Town, South Africa, 2012. Online, Retrieved August 12, 2012 from http://www.mica.edu.vn/sltu2012/files/proceedings/10.pdf.
Alake C.A.: Early Descriptions of the Yoruba Language: The Work of Samuel Ajayi Crowther. In: D. P., J. L., S. P., S. P., eds., The History of Linguisticand Grammatic Praxis. Proceedings of the XIth International Colloquium of the Studienkris "Geschichte der Sprachwissenschaft". Leuven, 2nd„4th July 1998., p.427„443. Peeters Publishers, 2000.
Asahiah F.O.: Development of a Standard Yorùbá Text Automatic Diacritic Restoration System. Phd thesis, Obafemi Awolowo University, Ile-Ife, Nigeria,2014.
Brill E., Ngai G.: Man vs. machine: a case study in base noun phrase learning. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 65–72. Association for Computational Linguistics, 1999.
De Palma P.A.: Syllables and concepts in large vocabulary speech recognition. Phd thesis, The University of New Mexico, New Mexico, The United States of America, 2010.
De Pauw G., Wagacha P.W., de Schryver G.: Automatic Diacritic Restoration for Resource„Scarce Languages. In: M.P.. Matousek V., ed., Text, Speech and Dialogue, 10th International Conference, TSD 2007, Pilsen, Czech Republic, September 3„7, 2007, Proceedings Lecture Notes in Artificial Intelligence LNAI, subseries of Lecture Notes in Computer Science LNCS, vol. 4629, p. 170-179. Springer-Verlag, Berlin, 2007.
O . dé .jo .bí O.A.: A Computational Model of Prosody for Yorábç Text„to„Speech Synthesis. Phd thesis, Aston University, Aston, 2005.
Fagborun J.G.: Disparities in Tonal and Vowel Representation: Some Practical Problems in Yoruba Orthography. In: Journal of West African Languages, vol. 19(2), 1989. Retrieved November 24, 2010 from GoogleDocs online.
Habash N., Rambow O.: Arabic Diacritization through Full Morphological Tagging. In: Proceedings of NAACL HLT 2007, vol. Companion Volume, p. 53-56. Association for Computational Linguistics, Rochester, NY, 2007.
Haertel R.A., McClanahan P., Ringger E.K.: Automatic Diacritization for Low„Resource Languages Using a Hybrid Word and Consonant CMM. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, June 2010, p.519-527. 2010.
Larson M., Eickeler S.: Using Syllable-based Indexing Features and Language Models to improve German Spoken Document Retrieval. In: Proceedings of Eurospeech. 8th European Conference on Speech Communication and Technology. 2003.
Liu X., Hieronymus J.L., Gales M.J., Woodland P.C.: Syllable language models for Mandarin speech recognition: Exploiting character language models. In: The Journal of the Acoustical Society of America, vol. 133(1), pp. 519–528, 2013.
Majewski P.: Syllable based language model for large vocabulary continuous speech recognition of polish. In: International Conference on Text, Speech and Dialogue, pp. 397–401. Springer, 2008.
Mihalcea R.: Diacritic Restoration: Learning from Letters versus Learning from Words. In: Proceedings of Computational Linguistics and Intelligent Text Processing, 3rd International Conference, CICLing 2002, Mexico City, vol. 2276, pp.339–438. Springer, 2002.
Nguyen K.H., Ock C.Y.: Diacritics restoration in vietnamese: letter based vs. syllable based model. In: PRICAI 2010: Trends in Artificial Intelligence, pp.631–636. Springer, 2010.
Olúmúyìwá T.: Yoruba Writing: Standards and Trends. In: Journal of Arts and Humanities, vol. 2(1), p. 40, 2013.
Šantić N., Šnajder J., Bašić B.D.: Automatic Diacritics Restoration in Croatian Texts. In: INFuture2009: Digital Resources and Knowledge Sharing, pp. 309–318. 2009.
Scannell K.P.: Statistical Unicodification of African Languages. In: Language Resources and Evaluation, pp. 1–12, 2011. Retrieved July 20, 2011 from http://borel.slu.edu/pub/lre.pdf.
Schlippe T., Nguyen T., Vogel S.: Diacritization as a Machine Translation Problem and as a Sequence Labeling Problem. In: AMTA-2008. MT at work: In Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas, pp. 270–278. Waikiki, Hawai’i, 2008.
Schrumpf C., Larson M., Eickeler S.: Syllable-based language models in speech recognition for English spoken document retrieval. In: Proc. of the 7th International Workshop of the EU Network of Excellence DELOS on AVIVDiLib, Cortona, Italy, pp. 196–205. 2005.
Surmei M., Burileanu D., Negrescu C., Pîrvu R., Ungurean C., Derviş A.: Textto-Speech Engines as Telecom Service Enablers. In: Advances in Spoken Language Technology, Publishing House of the Romanian Academy, Bucharest, pp. 89–98, 2007.
Truyen T.T., Phung D.Q., Venkatesh S.: Constrained Sequence Classification for Lexical Disambiguation. In: Lecture Notes in Computer Science including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. 5351, pp. 430–441. Springer, 2008. Retrieved from http://www.computing.edu.au/~trantt2/pubs/pricai08.pdf.
Tufiş D., Ceauşu A.: DIAC: A Professional Diacritics Recovering System. In: Proceedings of the Sixth International Language Resources and Evaluation. 2008. Paper 54 on Conference CD.
Tufiş D., Chiţu A.: Automatic Diacritic Insertion in Romanian Texts. In: Proceedings of the International Conference on Computational Lexicography COMPLEX’99. Pecs, Hungary, pp. 185–194. 1999.