Text Features Extraction based on TF-IDF Associating Semantic

The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words wi...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE 4th International Conference on Computer and Communications (ICCC) pp. 2338 - 2343
Main Authors	Liu, Qing, Wang, Jing, Zhang, Dehai, Yang, Yun, Wang, NaiYao
Format	Conference Proceeding
Language	English
Published	IEEE 01.12.2018
Subjects	Clustering Clustering algorithms Data mining Feature extraction Mathematical model Semantic features Semantics Software Text feature TF-IDF Training Word vector
Online Access	Get full text
DOI	10.1109/CompComm.2018.8780663

Cover

Abstract	The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.
AbstractList	The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.
Author	Liu, Qing Zhang, Dehai Wang, NaiYao Yang, Yun Wang, Jing
Author_xml	– sequence: 1 givenname: Qing surname: Liu fullname: Liu, Qing organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 2 givenname: Jing surname: Wang fullname: Wang, Jing organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 3 givenname: Dehai surname: Zhang fullname: Zhang, Dehai organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 4 givenname: Yun surname: Yang fullname: Yang, Yun organization: Software College, Yunnan University, Kunming, 650091, China – sequence: 5 givenname: NaiYao surname: Wang fullname: Wang, NaiYao organization: Software College, Yunnan University, Kunming, 650091, China
BookMark	eNotj81Kw0AUhUdQ0NY-gQjzAol3Mn83CxclNloouDD7MpncyIhJSmaE-vYG7OJwPjjwwVmx63EaibFHAbkQUD5V03BaMuQFCMzRIhgjr9hKaIkGpSzlLdvE-AUAhUENQt6x54bOidfk0s9Mke_OaXY-hWnkrYvU8QWaOtu_1Hwb4-SDS2H85B80uDEFf89uevcdaXPpNWvqXVO9ZYf31321PWShhJRZY70FWygtpO_R215pIwR2qlOKEDvntRboO2g9FbZ1QHJZy5IKp1oycs0e_rWBiI6nOQxu_j1eDso_tChJLQ
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/CompComm.2018.8780663
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1538683393 9781538683392
EndPage	2343
ExternalDocumentID	8780663
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAJGR AAWTH ABLEC ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:54:30 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63
PageCount	6
ParticipantIDs	ieee_primary_8780663
PublicationCentury	2000
PublicationDate	2018-Dec.
PublicationDateYYYYMMDD	2018-12-01
PublicationDate_xml	– month: 12 year: 2018 text: 2018-Dec.
PublicationDecade	2010
PublicationTitle	2018 IEEE 4th International Conference on Computer and Communications (ICCC)
PublicationTitleAbbrev	CompComm
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0002685013
Score	1.8220927
Snippet	The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions...
SourceID	ieee
SourceType	Publisher
StartPage	2338
SubjectTerms	Clustering Clustering algorithms Data mining Feature extraction Mathematical model Semantic features Semantics Software Text feature TF-IDF Training Word vector
Title	Text Features Extraction based on TF-IDF Associating Semantic
URI	https://ieeexplore.ieee.org/document/8780663
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3lS2cRvcvBou7ZLk_TgSTemMBGssNtoXlIZYiejBfGvN6-tFcWDtxAI-Xh5-b28vN8LwEWIucxzLjw0mfK4iNGpVJJ4MuA2zLjguSGi8OJezJ_43TJe9uCy48JYa-vgM-tTsX7LNxusyFU2VlIRQvah77ZZw9Xq_CmRULEzZ1qSThgkY1IoYllQ_Jby27Y_PlGpMWS2C4uv3pvQkRe_KrWPH78SM_53eHsw-mbrsYcOh_ahZ4shXKXu1GVk4FXuQs2m7-W2oTAwwi3DXMFtzNubGesEVDyzR_vqVnqNI0hn0_R67rVfJXjrJCg9KSTKQEY8DieYK5Q52UWhMtxwbpUyGVLmNTSBRhtJnQXk-jRJYqOMaysmBzAoNoU9BCZjrTg3Rk8Uco1JhiZ0WurUmCw_ZY5gSDNfvTXJMFbtpI__rj6BHVr9Jv7jFAbltrJnDsVLfV6L7xPqyZ0A
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_MedCTyiZ-m4NH27Xba5MePOnGptsQrLDbaF5SEbGT0YL415u0taJ48BYCIR8vL7-Xl_d7AbjwKeVpiqFDKhEOhgEZlYoih3uo_QRDTJUlCs_m4fgRbxfBogWXDRdGa10Gn2nXFsu3fLWiwrrKeoILi5AbsGlwH4OKrdV4VPqhCIxBU9N0fC_qWZWyPAsbwSXcuvWPb1RKFBntwOyr_yp45MUtcunSx6_UjP8d4C50v_l67L5Boj1o6awDV7E5d5k18QpzpWbD93xdkRiYRS7FTMFszcnNiDUiyp7Yg341a_1MXYhHw_h67NSfJTjPkZc7POTEPd7HwB9QKoin1jLyhUKFqIVQCdnca6Q8SbrPZeJZ56eKIt1PUOpwsA_tbJXpA2A8kAJRKTkQhJKihJRv9NQosrX9hDqEjp358q1Kh7GsJ330d_U5bI3j2XQ5nczvjmHbSqKKBjmBdr4u9KnB9FyelaL8BFZYoE0
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+4th+International+Conference+on+Computer+and+Communications+%28ICCC%29&rft.atitle=Text+Features+Extraction+based+on+TF-IDF+Associating+Semantic&rft.au=Liu%2C+Qing&rft.au=Wang%2C+Jing&rft.au=Zhang%2C+Dehai&rft.au=Yang%2C+Yun&rft.date=2018-12-01&rft.pub=IEEE&rft.spage=2338&rft.epage=2343&rft_id=info:doi/10.1109%2FCompComm.2018.8780663&rft.externalDocID=8780663