DIACRITIC-AWARE YORÙBÁ SPELL CHECKER
DOI:
https://doi.org/10.7494/csci.2023.24.1.4494Abstract
Spell checking and correction is still in infancy for Yorùbá language, and existing tools cannot be applied directly to address the problem because Yorùbá language requires extensive use of diacritics for marking phonemes and tone. We addressed this problem by collecting data from on-line sources and from optical character recognition of hard copy of books. The features relevant to spell checking and correction in this language that marks tones (and underdot) were identified through the review of existing spell checking solutions, analysis of the data collected and consultation with relevant Yorùbá Linguistics textbooks. A conceptual model was formulated as a parallel combination of a unigram language model and a language diacritic model to form a dictionary sub-model that is used by Error Detection and Candidate Generation modules. The candidate generation module was implemented as an inverse Levensthein edit-distance algorithm.
The system was evaluated using Detection Accuracy (calculated from Precision and Recall) and Suggestion Accuracy (SA) as metrics.Our experimental setups compared the performance of the component subsystems when used alone with the their combination into a unified model. The detection accuracies for each of the models were 93.23%, 94.10% and 95.01% respectively while their suggestion accuracies were 26.94%, 72.10% and 65.89% respectively. In relation to the size of training corpus, the unified model was able to achieve a increase of 1.83% in detection accuracy and 5.27% in suggestion accuracy for 70% increase in size of corpus. The results indicated that each of the sub-models in the dictionary played different roles while the increase in training data does not give a linear proportional increase in performance of the spell checker. The study also showed that spell checking a Yorùbá text was better when attention is paid to the diacritical aspect of the language
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2023 Computer Science
This work is licensed under a Creative Commons Attribution 4.0 International License.