A Survey of Current Datasets for Code-Switching Research

Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same tex...

Full description

Saved in:
Bibliographic Details
Published in2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) pp. 136 - 141
Main Authors Jose, Navya, Chakravarthi, Bharathi Raja, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.
ISBN:1728151961
9781728151960
ISSN:2575-7288
DOI:10.1109/ICACCS48705.2020.9074205