Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Although Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However,...

Full description

Saved in:
Bibliographic Details
Published inDiscover Computing Vol. 27; no. 1; p. 17
Main Authors Agyei, Emmanuel, Zhang, Xiaoling, Bannerman, Stephen, Quaye, Ama Bonuah, Yussi, Sophyani Banaamwini, Agbesi, Victor Kwaku
Format Journal Article
LanguageEnglish
Published Dordrecht Springer Netherlands 05.07.2024
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN2948-2992
1386-4564
2948-2992
1573-7659
DOI10.1007/s10791-024-09451-8

Cover

More Information
Summary:Although Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However, it continues to be seen as the perfect resource for Twi Machine Translation (MT) of IS0 639-3. The issue with the Twi-English parallel corpus is eminent at the multiple domain dataset level, partly due to the complex design structure and scarcity of the digital Twi lexicon. This study introduced Twi-2-ENG, a large-scale multiple domain Twi to English parallel corpus, Twi digital Dictionary, and lexicon version of Twi. Also, it employed the Ghanaian Parliamentary Hansards, crowdsourcing, and digital Ghana News Portals to crawl all the English sentences. Our curled news portals accumulated 5,765 parallel corpus sentences, the Twi New Testament Bible, and social media platforms. The data-gathering method used means of translation, compilation, tokenization, and the final alignments with the Twi-English parallel sentences, including the technology employed in compiling and hosting the corpus, were duly discussed. The results reveal that the role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus. Finally, all the sentences were curated with the help of a corpus manager, sketch engine, linguistics, and professional translators to align and tokenize all texts, allowing the Twi professional linguists to evaluate the corpus.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2948-2992
1386-4564
2948-2992
1573-7659
DOI:10.1007/s10791-024-09451-8