Text Features Extraction based on TF-IDF Associating Semantic

The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words wi...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE 4th International Conference on Computer and Communications (ICCC) pp. 2338 - 2343
Main Authors Liu, Qing, Wang, Jing, Zhang, Dehai, Yang, Yun, Wang, NaiYao
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2018
Subjects
Online AccessGet full text
DOI10.1109/CompComm.2018.8780663

Cover

Abstract The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.
AbstractList The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms. Separating words with the same or similar meanings will result in the loss of partial information when text feature were extracted. The representation of words needs to extract the similarity of words, and the similarity among words needs to be obtained by the meaning of words in texts. In order to improve the accuracy of text feature extraction, this paper uses the word2vec model to train the word vector in the corpus to obtain its semantic features. After excluding words with low TF-IDF value, the density clustering algorithm is used to cluster the remaining words according to word vector similarity. As a result, similar words are clustered together and can be represented to each other. Experiments show that using the TF-IDF algorithm again, constructing a VSM (vector space model) with these clusters as feature units can effectively improve the accuracy of text feature extraction.
Author Liu, Qing
Zhang, Dehai
Wang, NaiYao
Yang, Yun
Wang, Jing
Author_xml – sequence: 1
  givenname: Qing
  surname: Liu
  fullname: Liu, Qing
  organization: Software College, Yunnan University, Kunming, 650091, China
– sequence: 2
  givenname: Jing
  surname: Wang
  fullname: Wang, Jing
  organization: Software College, Yunnan University, Kunming, 650091, China
– sequence: 3
  givenname: Dehai
  surname: Zhang
  fullname: Zhang, Dehai
  organization: Software College, Yunnan University, Kunming, 650091, China
– sequence: 4
  givenname: Yun
  surname: Yang
  fullname: Yang, Yun
  organization: Software College, Yunnan University, Kunming, 650091, China
– sequence: 5
  givenname: NaiYao
  surname: Wang
  fullname: Wang, NaiYao
  organization: Software College, Yunnan University, Kunming, 650091, China
BookMark eNotj81Kw0AUhUdQ0NY-gQjzAol3Mn83CxclNloouDD7MpncyIhJSmaE-vYG7OJwPjjwwVmx63EaibFHAbkQUD5V03BaMuQFCMzRIhgjr9hKaIkGpSzlLdvE-AUAhUENQt6x54bOidfk0s9Mke_OaXY-hWnkrYvU8QWaOtu_1Hwb4-SDS2H85B80uDEFf89uevcdaXPpNWvqXVO9ZYf31321PWShhJRZY70FWygtpO_R215pIwR2qlOKEDvntRboO2g9FbZ1QHJZy5IKp1oycs0e_rWBiI6nOQxu_j1eDso_tChJLQ
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CompComm.2018.8780663
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1538683393
9781538683392
EndPage 2343
ExternalDocumentID 8780663
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
AAWTH
ABLEC
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63
IEDL.DBID RIE
IngestDate Wed Aug 27 02:54:30 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-767c70724513cf8c7f456118d4d44e88dac5518cd0bce27ba0e318d99e2a4be63
PageCount 6
ParticipantIDs ieee_primary_8780663
PublicationCentury 2000
PublicationDate 2018-Dec.
PublicationDateYYYYMMDD 2018-12-01
PublicationDate_xml – month: 12
  year: 2018
  text: 2018-Dec.
PublicationDecade 2010
PublicationTitle 2018 IEEE 4th International Conference on Computer and Communications (ICCC)
PublicationTitleAbbrev CompComm
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002685013
Score 1.8220927
Snippet The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions...
SourceID ieee
SourceType Publisher
StartPage 2338
SubjectTerms Clustering
Clustering algorithms
Data mining
Feature extraction
Mathematical model
Semantic features
Semantics
Software
Text feature
TF-IDF
Training
Word vector
Title Text Features Extraction based on TF-IDF Associating Semantic
URI https://ieeexplore.ieee.org/document/8780663
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH9sO3lS2cRvcvBou7ZLk_TgSTemMBGssNtoXlIZYiejBfGvN6-tFcWDtxAI-Xh5-b28vN8LwEWIucxzLjw0mfK4iNGpVJJ4MuA2zLjguSGi8OJezJ_43TJe9uCy48JYa-vgM-tTsX7LNxusyFU2VlIRQvah77ZZw9Xq_CmRULEzZ1qSThgkY1IoYllQ_Jby27Y_PlGpMWS2C4uv3pvQkRe_KrWPH78SM_53eHsw-mbrsYcOh_ahZ4shXKXu1GVk4FXuQs2m7-W2oTAwwi3DXMFtzNubGesEVDyzR_vqVnqNI0hn0_R67rVfJXjrJCg9KSTKQEY8DieYK5Q52UWhMtxwbpUyGVLmNTSBRhtJnQXk-jRJYqOMaysmBzAoNoU9BCZjrTg3Rk8Uco1JhiZ0WurUmCw_ZY5gSDNfvTXJMFbtpI__rj6BHVr9Jv7jFAbltrJnDsVLfV6L7xPqyZ0A
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_MedCTyiZ-m4NH27Xba5MePOnGptsQrLDbaF5SEbGT0YL415u0taJ48BYCIR8vL7-Xl_d7AbjwKeVpiqFDKhEOhgEZlYoih3uo_QRDTJUlCs_m4fgRbxfBogWXDRdGa10Gn2nXFsu3fLWiwrrKeoILi5AbsGlwH4OKrdV4VPqhCIxBU9N0fC_qWZWyPAsbwSXcuvWPb1RKFBntwOyr_yp45MUtcunSx6_UjP8d4C50v_l67L5Boj1o6awDV7E5d5k18QpzpWbD93xdkRiYRS7FTMFszcnNiDUiyp7Yg341a_1MXYhHw_h67NSfJTjPkZc7POTEPd7HwB9QKoin1jLyhUKFqIVQCdnca6Q8SbrPZeJZ56eKIt1PUOpwsA_tbJXpA2A8kAJRKTkQhJKihJRv9NQosrX9hDqEjp358q1Kh7GsJ330d_U5bI3j2XQ5nczvjmHbSqKKBjmBdr4u9KnB9FyelaL8BFZYoE0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+4th+International+Conference+on+Computer+and+Communications+%28ICCC%29&rft.atitle=Text+Features+Extraction+based+on+TF-IDF+Associating+Semantic&rft.au=Liu%2C+Qing&rft.au=Wang%2C+Jing&rft.au=Zhang%2C+Dehai&rft.au=Yang%2C+Yun&rft.date=2018-12-01&rft.pub=IEEE&rft.spage=2338&rft.epage=2343&rft_id=info:doi/10.1109%2FCompComm.2018.8780663&rft.externalDocID=8780663