Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP

Short text is one of the data formats usually generated by people on social media, for instance, tweets on Twitter. They are often used as data to analyze what is trending in the community. However, topic modeling or text clustering algorithms on short text have some unique problems. Namely, sparsit...

Full description

Saved in:

Bibliographic Details
Published in	2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA) pp. 1 - 6
Main Authors	Asyaky, Muhammad Sidik, Mandala, Rila
Format	Conference Proceeding
Language	English
Published	IEEE 29.09.2021
Subjects	Bit error rate Blogs dimension reduction Dimensionality reduction Measurement Semantics short text Social networking (online) text clustering Vocabulary word embedding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Short text is one of the data formats usually generated by people on social media, for instance, tweets on Twitter. They are often used as data to analyze what is trending in the community. However, topic modeling or text clustering algorithms on short text have some unique problems. Namely, sparsity which is caused by too many unique words only appear in few documents, and a lack of word co-occurrences that makes it difficult for the system to find semantic information of words. To overcome those 2 problems, we propose a novel method to use the word embedding technique to represent the document in vector space. FastText and BERT embedding models are chosen not only because of the quality of their text representation but also their ability to handle out of vocabulary words. As for clustering, the HDBSCAN algorithm is used because of its ability to handle noise. However, it has poor performance on clustering high-dimensional data. Because vectors resulting from word embedding are high-dimensional, therefore dimension reduction by UMAP is done on the vectors before feeding it to HDBSCAN. The experimental results prove that our novel method is better than the baseline, which is evaluated on purity and NMI metrics.
DOI:	10.1109/ICAICTA53211.2021.9640285