A Survey of Current Datasets for Code-Switching Research

Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same tex...

Full description

Saved in:

Bibliographic Details
Published in	2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) pp. 136 - 141
Main Authors	Jose, Navya, Chakravarthi, Bharathi Raja, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2020
Subjects	code switching dataset Measurement Natural language processing Social network services Switches Tagging Task analysis Vocabulary
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-of-speech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.
ISBN:	1728151961 9781728151960
ISSN:	2575-7288
DOI:	10.1109/ICACCS48705.2020.9074205