Text Features Extraction based on TF-IDF Associating Semantic
The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words wi...
Saved in:
Published in | 2018 IEEE 4th International Conference on Computer and Communications (ICCC) pp. 2338 - 2343 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2018
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/CompComm.2018.8780663 |
Cover
Abstract | The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction. |
---|---|
AbstractList | The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction. |
Author | Liu, Qing Zhang, Dehai Wang, NaiYao Yang, Yun Wang, Jing |
Author_xml | – sequence: 1 givenname: Qing surname: Liu fullname: Liu, Qing organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 2 givenname: Jing surname: Wang fullname: Wang, Jing organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 3 givenname: Dehai surname: Zhang fullname: Zhang, Dehai organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 4 givenname: Yun surname: Yang fullname: Yang, Yun organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 5 givenname: NaiYao surname: Wang fullname: Wang, NaiYao organization: Software College, Yunnan University, Kunming, 650091, China |
BookMark | eNotj81Kw0AUhUdQ0NY-gQjzAol3Mn83CxclNloouDD7MpncyIhJSmaE-vYG7OJwPjjwwVmx63EaibFHAbkQUD5V03BaMuQFCMzRIhgjr9hKaIkGpSzlLdvE-AUAhUENQt6x54bOidfk0s9Mke_OaXY-hWnkrYvU8QWaOtu_1Hwb4-SDS2H85B80uDEFf89uevcdaXPpNWvqXVO9ZYf31321PWShhJRZY70FWygtpO_R215pIwR2qlOKEDvntRboO2g9FbZ1QHJZy5IKp1oycs0e_rWBiI6nOQxu_j1eDso_tChJLQ |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/CompComm.2018.8780663 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1538683393 9781538683392 |
EndPage | 2343 |
ExternalDocumentID | 8780663 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IL 6IN AAJGR AAWTH ABLEC ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK OCL RIE RIL |
ID | FETCH-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:54:30 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63 |
PageCount | 6 |
ParticipantIDs | ieee_primary_8780663 |
PublicationCentury | 2000 |
PublicationDate | 2018-Dec. |
PublicationDateYYYYMMDD | 2018-12-01 |
PublicationDate_xml | – month: 12 year: 2018 text: 2018-Dec. |
PublicationDecade | 2010 |
PublicationTitle | 2018 IEEE 4th International Conference on Computer and Communications (ICCC) |
PublicationTitleAbbrev | CompComm |
PublicationYear | 2018 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0002685013 |
Score | 1.8220927 |
Snippet | The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 2338 |
SubjectTerms | Clustering Clustering algorithms Data mining Feature extraction Mathematical model Semantic features Semantics Software Text feature TF-IDF Training Word vector |
Title | Text Features Extraction based on TF-IDF Associating Semantic |
URI | https://ieeexplore.ieee.org/document/8780663 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3lS2cRvcvBou7ZLk_TgSTemMBGssNtoXlIZYiejBfGvN6-tFcWDtxAI-Xh5-b28vN8LwEWIucxzLjw0mfK4iNGpVJJ4MuA2zLjguSGi8OJezJ_43TJe9uCy48JYa-vgM-tTsX7LNxusyFU2VlIRQvah77ZZw9Xq_CmRULEzZ1qSThgkY1IoYllQ_Jby27Y_PlGpMWS2C4uv3pvQkRe_KrWPH78SM_53eHsw-mbrsYcOh_ahZ4shXKXu1GVk4FXuQs2m7-W2oTAwwi3DXMFtzNubGesEVDyzR_vqVnqNI0hn0_R67rVfJXjrJCg9KSTKQEY8DieYK5Q52UWhMtxwbpUyGVLmNTSBRhtJnQXk-jRJYqOMaysmBzAoNoU9BCZjrTg3Rk8Uco1JhiZ0WurUmCw_ZY5gSDNfvTXJMFbtpI__rj6BHVr9Jv7jFAbltrJnDsVLfV6L7xPqyZ0A |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_MedCTyiZ-m4NH27Xba5MePOnGptsQrLDbaF5SEbGT0YL415u0taJ48BYCIR8vL7-Xl_d7AbjwKeVpiqFDKhEOhgEZlYoih3uo_QRDTJUlCs_m4fgRbxfBogWXDRdGa10Gn2nXFsu3fLWiwrrKeoILi5AbsGlwH4OKrdV4VPqhCIxBU9N0fC_qWZWyPAsbwSXcuvWPb1RKFBntwOyr_yp45MUtcunSx6_UjP8d4C50v_l67L5Boj1o6awDV7E5d5k18QpzpWbD93xdkRiYRS7FTMFszcnNiDUiyp7Yg341a_1MXYhHw_h67NSfJTjPkZc7POTEPd7HwB9QKoin1jLyhUKFqIVQCdnca6Q8SbrPZeJZ56eKIt1PUOpwsA_tbJXpA2A8kAJRKTkQhJKihJRv9NQosrX9hDqEjp358q1Kh7GsJ330d_U5bI3j2XQ5nczvjmHbSqKKBjmBdr4u9KnB9FyelaL8BFZYoE0 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+4th+International+Conference+on+Computer+and+Communications+%28ICCC%29&rft.atitle=Text+Features+Extraction+based+on+TF-IDF+Associating+Semantic&rft.au=Liu%2C+Qing&rft.au=Wang%2C+Jing&rft.au=Zhang%2C+Dehai&rft.au=Yang%2C+Yun&rft.date=2018-12-01&rft.pub=IEEE&rft.spage=2338&rft.epage=2343&rft_id=info:doi/10.1109%2FCompComm.2018.8780663&rft.externalDocID=8780663 |