Supporting Undotted Arabic with Pre-trained Language Models
We observe a recent behaviour on social media, in which users intentionally remove consonantal dots from Arabic letters, in order to bypass content-classification algorithms. Content classification is typically done by fine-tuning pre-trained language models, which have been recently employed by man...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
18.11.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We observe a recent behaviour on social media, in which users intentionally
remove consonantal dots from Arabic letters, in order to bypass
content-classification algorithms. Content classification is typically done by
fine-tuning pre-trained language models, which have been recently employed by
many natural-language-processing applications. In this work we study the effect
of applying pre-trained Arabic language models on "undotted" Arabic texts. We
suggest several ways of supporting undotted texts with pre-trained models,
without additional training, and measure their performance on two Arabic
natural-language-processing downstream tasks. The results are encouraging; in
one of the tasks our method shows nearly perfect performance. |
---|---|
DOI: | 10.48550/arxiv.2111.09791 |