Diacritic-aware yorùbá spell checker

Franklin Oladiipo Asahiah; Mary Taiwo Onifade; Abayomi Emmanuel Adegunlehin; Adekemisola Olufunmilayo Asahiah; Adekemi Olawunmi Amoo

doi:10.7494/csci.2023.24.1.4494

Authors

Franklin Oladiipo Asahiah Obafemi Awolowo University,
Mary Taiwo Onifade Obafemi Awolowo University, Ile-Ife
Abayomi Emmanuel Adegunlehin Obafemi Awolowo University, Ile-Ife
Adekemisola Olufunmilayo Asahiah Obafemi Awolowo University, Ile-Ife
Adekemi Olawunmi Amoo

DOI:

https://doi.org/10.7494/csci.2023.24.1.4494

Abstract

Spell checking and correction is still in infancy for Yorùbá language, and existing tools cannot be applied directly to address the problem because Yorùbá language requires extensive use of diacritics for marking phonemes and tone. We addressed this problem by collecting data from on-line sources and from optical character recognition of hard copy of books. The features relevant to spell checking and correction in this language that marks tones (and underdot) were identified through the review of existing spell checking solutions, analysis of the data collected and consultation with relevant Yorùbá Linguistics textbooks. A conceptual model was formulated as a parallel combination of a unigram language model and a language diacritic model to form a dictionary sub-model that is used by Error Detection and Candidate Generation modules. The candidate generation module was implemented as an inverse Levensthein edit-distance algorithm.

The system was evaluated using Detection Accuracy (calculated from Precision and Recall) and Suggestion Accuracy (SA) as metrics.Our experimental setups compared the performance of the component subsystems when used alone with the their combination into a unified model. The detection accuracies for each of the models were 93.23%, 94.10% and 95.01% respectively while their suggestion accuracies were 26.94%, 72.10% and 65.89% respectively. In relation to the size of training corpus, the unified model was able to achieve a increase of 1.83% in detection accuracy and 5.27% in suggestion accuracy for 70% increase in size of corpus. The results indicated that each of the sub-models in the dictionary played different roles while the increase in training data does not give a linear proportional increase in performance of the spell checker. The study also showed that spell checking a Yorùbá text was better when attention is paid to the diacritical aspect of the language

Downloads

Download data is not yet available.

Author Biography

Franklin Oladiipo Asahiah, Obafemi Awolowo University,

Asahiah Franklin Oladiipo is a researcher in the fields of human language processing and engineering. He enjoys exploring the application of machine intelligence to problem-solving. He is a Senior Lecturer and has several publications to his credit.