Research on the TF–IDF algorithm combined with semantics for automatic extraction of keywords from network news texts

As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRan...

Full description

Saved in:
Bibliographic Details
Published inJournal of intelligent systems Vol. 33; no. 1; pp. 455 - 65
Main Author Wang, Yan
Format Journal Article
LanguageEnglish
Published Berlin De Gruyter 18.07.2024
Walter de Gruyter GmbH
Subjects
Online AccessGet full text

Cover

Loading…
Abstract As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
AbstractList As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and F -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the F -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and F-measure were 72.77, 78.64, and 75.59%, respectively. Finally, the F-measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
Author Wang, Yan
Author_xml – sequence: 1
  givenname: Yan
  surname: Wang
  fullname: Wang, Yan
  email: wangyan@caztc.edu.cn
  organization: School of Literature, Cangzhou Normal University, Cangzhou, Hebei, 061000, China
BookMark eNp1UU1rGzEUFCWFpmnOvQpy3lqfu1ZvIa0TQ6BQUuhNvNVKtpzdVSLJbHzLf-g_7C-JHLekFKLLeyNmhuHNe3Q0htEi9JGST1RSOdv4tEsVI4xXhBPyBh0zqmhFWP3z6J_9HTpNaUPKE4rKuTxG03ebLESzxmHEeW3xzeL346_llwWGfhWiz-sBmzC0frQdngrEyQ4wZm8SdiFi2OYwQIHYPuQIJvviExy-tbspxK6QYhjwaHNBt2VOCefCTB_QWwd9sqd_5gn6sfh6c3FVXX-7XF6cX1dGsCZXkjvLCWtVS1nHhYNaNQaUoRwE76SjbUPBuUYqyQi0YFrT0ZrxmgjHnWD8BC0Pvl2Ajb6LfoC40wG8fv4IcaUhlvi91aQGQVvGFeW1gIbBvCZKAnAlOgrcFa-zg9ddDPdbm7LehG0cS3zNyZyxOWecFpY8sEwMKUXrtPEZ9ncp9_G9pkTvK9PPlel9ZXpfWdHN_tP9Tfu64vNBMUGfbezsKm53ZXmJ9ZqyxHwCFAqxyA
CitedBy_id crossref_primary_10_1109_ACCESS_2025_3526885
crossref_primary_10_26599_JIC_2025_9180081
Cites_doi 10.18653/v1/D16-1191
10.1142/S0218213002000861
10.1016/j.psep.2021.09.022
10.1016/j.ins.2019.09.013
10.1088/1757-899X/1131/1/012017
10.1007/s11192-018-2743-5
10.1016/j.ipm.2023.103614
10.1088/1742-6596/1744/4/042135
10.1002/asi.24279
10.3233/JIFS-211044
10.1108/eb026526
10.1088/1742-6596/1994/1/012031
10.1016/j.ipm.2021.102802
10.1016/S0306-4573(00)00050-9
10.1002/asi.24430
10.1016/j.eswa.2021.115139
10.1145/3388971
10.1149/10701.13329ecst
10.1145/3446132.3446397
ContentType Journal Article
Copyright 2024. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2024. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
JQ2
DOA
DOI 10.1515/jisys-2023-0300
DatabaseName CrossRef
ProQuest Computer Science Collection
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
ProQuest Computer Science Collection
DatabaseTitleList CrossRef

ProQuest Computer Science Collection

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2191-026X
EndPage 65
ExternalDocumentID oai_doaj_org_article_06a41b2391364a72a86095aa394d1a3f
10_1515_jisys_2023_0300
10_1515_jisys_2023_0300331
GroupedDBID 0R~
0~D
4.4
7WY
AAFPC
AAFWJ
AAGVJ
AAPJK
AAQCX
AASOL
AASQH
AAXCG
ABAOT
ABAQN
ABFKT
ABIQR
ABSOE
ABUVI
ABXMZ
ABYKJ
ACEFL
ACGFS
ACZBO
ADGQD
ADGYE
ADJVZ
ADMLS
ADOZN
AEJTT
AEQDQ
AERZL
AEXIE
AFBAA
AFBDD
AFCXV
AFPKN
AFQUK
AHGBP
AHGSO
AIERV
AJATJ
AKXKS
ALMA_UNASSIGNED_HOLDINGS
ARCSS
BAKPI
BBCWN
BCIFA
CFGNV
EBS
GROUPED_DOAJ
HZ~
IY9
M0C
M48
O9-
OK1
P2P
QD8
RDG
SA.
SLJYH
AAYXX
CITATION
JQ2
ID FETCH-LOGICAL-c427t-53fe302b9b12d34fa697ca9c13a43d5f1b71aff759520abacbcd1623604f3f423
IEDL.DBID M48
ISSN 2191-026X
0334-1860
IngestDate Wed Aug 27 01:26:23 EDT 2025
Mon Jun 30 13:53:17 EDT 2025
Tue Jul 01 03:02:07 EDT 2025
Thu Apr 24 23:12:30 EDT 2025
Thu Jul 10 10:31:22 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License This work is licensed under the Creative Commons Attribution 4.0 International License.
http://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c427t-53fe302b9b12d34fa697ca9c13a43d5f1b71aff759520abacbcd1623604f3f423
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.1515/jisys-2023-0300
PQID 3082283231
PQPubID 2031329
PageCount 10
ParticipantIDs doaj_primary_oai_doaj_org_article_06a41b2391364a72a86095aa394d1a3f
proquest_journals_3082283231
crossref_citationtrail_10_1515_jisys_2023_0300
crossref_primary_10_1515_jisys_2023_0300
walterdegruyter_journals_10_1515_jisys_2023_0300331
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-07-18
PublicationDateYYYYMMDD 2024-07-18
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-18
  day: 18
PublicationDecade 2020
PublicationPlace Berlin
PublicationPlace_xml – name: Berlin
PublicationTitle Journal of intelligent systems
PublicationYear 2024
Publisher De Gruyter
Walter de Gruyter GmbH
Publisher_xml – name: De Gruyter
– name: Walter de Gruyter GmbH
References 2024071816334367403_j_jisys-2023-0300_ref_009
2024071816334367403_j_jisys-2023-0300_ref_007
2024071816334367403_j_jisys-2023-0300_ref_018
2024071816334367403_j_jisys-2023-0300_ref_008
2024071816334367403_j_jisys-2023-0300_ref_019
2024071816334367403_j_jisys-2023-0300_ref_005
2024071816334367403_j_jisys-2023-0300_ref_016
2024071816334367403_j_jisys-2023-0300_ref_006
2024071816334367403_j_jisys-2023-0300_ref_017
2024071816334367403_j_jisys-2023-0300_ref_003
2024071816334367403_j_jisys-2023-0300_ref_014
2024071816334367403_j_jisys-2023-0300_ref_004
2024071816334367403_j_jisys-2023-0300_ref_015
2024071816334367403_j_jisys-2023-0300_ref_001
2024071816334367403_j_jisys-2023-0300_ref_012
2024071816334367403_j_jisys-2023-0300_ref_023
2024071816334367403_j_jisys-2023-0300_ref_002
2024071816334367403_j_jisys-2023-0300_ref_013
2024071816334367403_j_jisys-2023-0300_ref_010
2024071816334367403_j_jisys-2023-0300_ref_021
2024071816334367403_j_jisys-2023-0300_ref_011
2024071816334367403_j_jisys-2023-0300_ref_022
2024071816334367403_j_jisys-2023-0300_ref_020
References_xml – ident: 2024071816334367403_j_jisys-2023-0300_ref_009
  doi: 10.18653/v1/D16-1191
– ident: 2024071816334367403_j_jisys-2023-0300_ref_008
  doi: 10.1142/S0218213002000861
– ident: 2024071816334367403_j_jisys-2023-0300_ref_001
  doi: 10.1016/j.psep.2021.09.022
– ident: 2024071816334367403_j_jisys-2023-0300_ref_010
  doi: 10.1016/j.ins.2019.09.013
– ident: 2024071816334367403_j_jisys-2023-0300_ref_003
  doi: 10.1088/1757-899X/1131/1/012017
– ident: 2024071816334367403_j_jisys-2023-0300_ref_005
  doi: 10.1007/s11192-018-2743-5
– ident: 2024071816334367403_j_jisys-2023-0300_ref_011
  doi: 10.1016/j.ipm.2023.103614
– ident: 2024071816334367403_j_jisys-2023-0300_ref_020
  doi: 10.1088/1742-6596/1744/4/042135
– ident: 2024071816334367403_j_jisys-2023-0300_ref_018
– ident: 2024071816334367403_j_jisys-2023-0300_ref_013
  doi: 10.1002/asi.24279
– ident: 2024071816334367403_j_jisys-2023-0300_ref_002
  doi: 10.3233/JIFS-211044
– ident: 2024071816334367403_j_jisys-2023-0300_ref_015
  doi: 10.1108/eb026526
– ident: 2024071816334367403_j_jisys-2023-0300_ref_017
  doi: 10.1088/1742-6596/1994/1/012031
– ident: 2024071816334367403_j_jisys-2023-0300_ref_004
– ident: 2024071816334367403_j_jisys-2023-0300_ref_006
  doi: 10.1016/j.ipm.2021.102802
– ident: 2024071816334367403_j_jisys-2023-0300_ref_007
  doi: 10.1016/S0306-4573(00)00050-9
– ident: 2024071816334367403_j_jisys-2023-0300_ref_012
  doi: 10.1002/asi.24430
– ident: 2024071816334367403_j_jisys-2023-0300_ref_023
– ident: 2024071816334367403_j_jisys-2023-0300_ref_016
  doi: 10.1016/j.eswa.2021.115139
– ident: 2024071816334367403_j_jisys-2023-0300_ref_014
  doi: 10.1145/3388971
– ident: 2024071816334367403_j_jisys-2023-0300_ref_019
  doi: 10.1149/10701.13329ecst
– ident: 2024071816334367403_j_jisys-2023-0300_ref_021
– ident: 2024071816334367403_j_jisys-2023-0300_ref_022
  doi: 10.1145/3446132.3446397
SSID ssj0000491585
Score 2.324936
Snippet As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast...
SourceID doaj
proquest
crossref
walterdegruyter
SourceType Open Website
Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 455
SubjectTerms 68W40
Accuracy
Algorithms
automatic keyword extraction
Extraction processes
Information retrieval
Keywords
News
news text
precision
Recall
Semantics
term frequency–inverse document frequency
Texts
SummonAdditionalLinks – databaseName: DOAJ Directory of Open Access Journals
  dbid: DOA
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3LbtUwELVQV2zKW1woaBYs2JjGsfPwsjyuChKsWqk7a_wqrXoTdJOrq-74h_4hX8I4j9IiVWzYZBFNIstz7DkTT84w9oaCWpY5UXL0ouDK6pqjKpG7OmgbKMkNQx-yr9_Kw2P15aQ4udHqK9WEjfLA48TtZyUqYXOphSwVVjnWSSINUWrlBcqYdl-KeTeSqfOR9woiwpOWD8Xs_fOz7rLjqVk4J1xnt8LQoNZ_i2LubofDah9O15vLfj4cHWLO8iHbncgiHIyDfMTuheYxezA3YoBpXT5h27l-DtoGiNHB0fLXz6vPH5eAF6ctpf_fV0DIoiQ4eEhfXqELK5rSM9cBkVbATd8O0q1AW_V6_NUB2gi0wLeUnJLRul1BM1aMQyLikApGuqfsePnp6MMhnzoqcKfyqueFjEFmudVW5F6qiKWuHGonJCrpiyhsJTDGqtBFnqFFZ50XRJDKTEUZiXk9YztN24TnDMiLsXaudHRRvg5YCOJOsa6F85o4zoK9myfYuEluPHW9uDAp7SCPmMEjJnnEJI8s2NvrB36MSht3m75PHrs2SxLZww0CjpmAY_4FnAXbm_1tpnXbmSTek5o3SbFg8i8M_LG6Y1hSihf_Y2Qv2X16p0pfkEW9x3b69Sa8IurT29cDyn8D8VsDqw
  priority: 102
  providerName: Directory of Open Access Journals
Title Research on the TF–IDF algorithm combined with semantics for automatic extraction of keywords from network news texts
URI https://www.degruyter.com/doi/10.1515/jisys-2023-0300
https://www.proquest.com/docview/3082283231
https://doaj.org/article/06a41b2391364a72a86095aa394d1a3f
Volume 33
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZQe-FSKA-xpVQ-cODiEsfOwweEKLAqSOXUlXqz_Fy22k0gyWrZG_-Bf8gv6dhJWhW1Fy6REo2lyDOT-cZ2vg-h11DUksTQnChLM8K1KIniuSKmdEI7aHJd1CE7-5afzvjXi-ziRg5omMD2ztYu6EnNmuXxr5_b95Dw76J6D83eXi7abUuCDjiBkIX-fRfKUhGy9GzA-pc9FKaAjQd6nzvG3apMkcD_Furc28T9a-vmzXrbjfulsQxNH6O9AT_iD73D99EDVz1Bj0ZtBjyk6lO0GY_U4brCAPLw-fTv7z9fPk2xWs7rZtF9X2EINuiLncVhMRa3bgWzvDAtBhyL1bqrI5srhq930__9gGuPIec30K-CUVOvcNUfIscBm-NwhqR9hmbTz-cfT8kgskAMT4uOZMw7lqRaaJpaxr3KRWGUMJQpzmzmqS6o8r7IRJYmSiujjaWAmfKEe-YBjD1HO1VduRcIg2N9aUxu4MJt6VRGAU75sqTGCoA9E3Q8TrA0AwN5EMJYytCJgEdk9IgMHpHBIxP05nrAj558437Tk-Cxa7PAmh0f1M1cDkkok1xxqlMmKMu5KlJVBro9pZjglirmJ-hw9LccI1EGPp-g58ToBLF_YuDG6p7XYowe_Neol-gh3PKwikzLQ7TTNWv3CuBPp4_issFRDO4r5wIIqg
linkProvider Scholars Portal
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwEB6V7QEuLU-xtIAPHLiEjWPndSyFZQttObCVerMcP7aLdhOUZLXqjf_AP-SXMM4LiuiFSw7JWHI8nsw3k_E3AK_Qqfm-opEnNQ09nqWJJ3kkPZWYNDMY5JqmD9nZeTS74B8vw8sdOO7PwriySm0W5ea6bhlSJ7pQG5coG7gG0ANPvi6r68pzrb893KX-5Kper-7AboTwPxnB7tHsw5fPQ6oFQTBFVNwR-_xj-A2f1FD338Cbe9vmz_UwrT8c0PQ-7HXIkRy1qn4AOyZ_CPt9VwbSGekj2PbFdKTICcI7Mp_-_P7j5N2UyNWiKJf11ZrgG2NEbDRxaVhSmTWu71JVBBEskZu6aHhcCX63y_bcAyksQWvfYqSKQmWxJnlbPk4cKieueqR6DBfT9_Pjmde1V_AUD-LaC5k1zA-yNKOBZtzKKI2VTBVlkjMdWprFVFobh2kY-DKTKlOaIlqKfG6ZRRj2BEZ5kZunQFClNlEqUnjhOjEypAikbJJQpVMEPGN40y-wUB33uGuBsRIuBkGNiEYjwmlEOI2M4fUw4FtLu3G76FunsUHM8WU3N4pyITrzE34kOc0CllIWcRkHMnFEe1KylGsqmR3DYa9v0RlxJRyTj-vkxOgY2F974LfULdNijD77r1Ev4e5sfnYqTk_OPx3APXzEXS6ZJocwqsuNeY4gqM5edJv8F1Z1CLM
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZgKyEu5a1uH-ADBy5h49hx4mNpCVseBYlW6s1y_Nhu1U2qJKtVb_wH_iG_hHGSDS2iFy45JGPJ8Xgy30zG3yD0GpxaGGrCA2VIHLBcpIFiXAU6tSK3EOTatg_Zl2M-PWUfz-KzG2dhfFmlsbNqed10DKkTU-qlT5QNXAPggScX8_q6Dnzr7wB2aTi5Mu4-2uBcUDZCG_vTD9-_DpkWwMAEQHHP6_OP0bdcUsvcfwtubq7aH9fDrG74n-wx2uyBI97vNP0E3bPFU_Ro3ZQB9zb6DK3WtXS4LDCgO3yS_frx8-gww-pyVlbz5nyB4YUhILYG-ywsru0ClneuawwAFqtlU7Y0rhg-21V37AGXDoOxryBQBaGqXOCiqx7HHpRjXzxSP0en2fuTg2nQd1cINIuSJoipszSMcpGTyFDmFBeJVkITqhg1sSN5QpRzSSziKFS50rk2BMASD5mjDlDYCzQqysJuIQwadanWXMOFmdSqmACOcmlKtBGAd8bo7XqBpe6px30HjEvpQxDQiGw1Ir1GpNfIGL0ZBlx1rBt3i77zGhvEPF12e6OsZrK3PhlyxUgeUUEoZyqJVOp59pSighmiqBuj3bW-ZW_DtfREPr6REyVjRP_aA3-k7pgWpWT7v0a9Qg--HWby89Hxpx30EJ4wn0km6S4aNdXS7gEEavKX_R7_DbDHB9k
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Research+on+the+TF%E2%80%93IDF+algorithm+combined+with+semantics+for+automatic+extraction+of+keywords+from+network+news+texts&rft.jtitle=Journal+of+intelligent+systems&rft.au=Wang%2C+Yan&rft.date=2024-07-18&rft.pub=De+Gruyter&rft.eissn=2191-026X&rft.volume=33&rft.issue=1&rft_id=info:doi/10.1515%2Fjisys-2023-0300&rft.externalDBID=n%2Fa&rft.externalDocID=10_1515_jisys_2023_0300331
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2191-026X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2191-026X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2191-026X&client=summon