Research on the TF–IDF algorithm combined with semantics for automatic extraction of keywords from network news texts

As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRan...

Full description

Saved in:

Bibliographic Details
Published in	Journal of intelligent systems Vol. 33; no. 1; pp. 455 - 65
Main Author	Wang, Yan
Format	Journal Article
Language	English
Published	Berlin De Gruyter 18.07.2024 Walter de Gruyter GmbH
Subjects	68W40 Accuracy Algorithms automatic keyword extraction Extraction processes Information retrieval Keywords News news text precision Recall Semantics term frequency–inverse document frequency Texts
Online Access	Get full text

Cover

Loading…

Abstract	As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
AbstractList	As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and F -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the F -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice. As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and -measure were 72.77, 78.64, and 75.59%, respectively. Finally, the -measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice. As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast access to the desired content. This article first introduced two common algorithms: term frequency–inverse document frequency (TF–IDF) and TextRank. Then, the calculation of news title weight was added to the TF–IDF algorithm according to the characteristics of network news text. Moreover, a new automatic extraction algorithm was designed by applying Word2vec to extract semantics. The experimental results demonstrated that on the ACE2005 dataset, as the quantity of automatically extracted keywords increased, the accuracy of the TF–IDF, TextRank, and the semantics-combined TF–IDF algorithms gradually decreased, and the recall rates gradually increased. When five keywords were extracted, the gap of the semantics-combined TF–IDF algorithm with the other two algorithms was the largest, and its accuracy, recall rate, and F-measure were 72.77, 78.64, and 75.59%, respectively. Finally, the F-measure of the semantics-combined TF–IDF algorithm reached 81% for network news texts. The experimental results prove the performance of the semantics-combined TF–IDF algorithm in automatically extracting keywords from network news texts, and it will have promising applications in practice.
Author	Wang, Yan
Author_xml	– sequence: 1 givenname: Yan surname: Wang fullname: Wang, Yan email: wangyan@caztc.edu.cn organization: School of Literature, Cangzhou Normal University, Cangzhou, Hebei, 061000, China
BookMark	eNp1UU1rGzEUFCWFpmnOvQpy3lqfu1ZvIa0TQ6BQUuhNvNVKtpzdVSLJbHzLf-g_7C-JHLekFKLLeyNmhuHNe3Q0htEi9JGST1RSOdv4tEsVI4xXhBPyBh0zqmhFWP3z6J_9HTpNaUPKE4rKuTxG03ebLESzxmHEeW3xzeL346_llwWGfhWiz-sBmzC0frQdngrEyQ4wZm8SdiFi2OYwQIHYPuQIJvviExy-tbspxK6QYhjwaHNBt2VOCefCTB_QWwd9sqd_5gn6sfh6c3FVXX-7XF6cX1dGsCZXkjvLCWtVS1nHhYNaNQaUoRwE76SjbUPBuUYqyQi0YFrT0ZrxmgjHnWD8BC0Pvl2Ajb6LfoC40wG8fv4IcaUhlvi91aQGQVvGFeW1gIbBvCZKAnAlOgrcFa-zg9ddDPdbm7LehG0cS3zNyZyxOWecFpY8sEwMKUXrtPEZ9ncp9_G9pkTvK9PPlel9ZXpfWdHN_tP9Tfu64vNBMUGfbezsKm53ZXmJ9ZqyxHwCFAqxyA
CitedBy_id	crossref_primary_10_1109_ACCESS_2025_3526885 crossref_primary_10_26599_JIC_2025_9180081
Cites_doi	10.18653/v1/D16-1191 10.1142/S0218213002000861 10.1016/j.psep.2021.09.022 10.1016/j.ins.2019.09.013 10.1088/1757-899X/1131/1/012017 10.1007/s11192-018-2743-5 10.1016/j.ipm.2023.103614 10.1088/1742-6596/1744/4/042135 10.1002/asi.24279 10.3233/JIFS-211044 10.1108/eb026526 10.1088/1742-6596/1994/1/012031 10.1016/j.ipm.2021.102802 10.1016/S0306-4573(00)00050-9 10.1002/asi.24430 10.1016/j.eswa.2021.115139 10.1145/3388971 10.1149/10701.13329ecst 10.1145/3446132.3446397
ContentType	Journal Article
Copyright	2024. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: 2024. This work is published under http://creativecommons.org/licenses/by/4.0 (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	AAYXX CITATION JQ2 DOA
DOI	10.1515/jisys-2023-0300
DatabaseName	CrossRef ProQuest Computer Science Collection DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef ProQuest Computer Science Collection
DatabaseTitleList	CrossRef ProQuest Computer Science Collection
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2191-026X
EndPage	65
ExternalDocumentID	oai_doaj_org_article_06a41b2391364a72a86095aa394d1a3f 10_1515_jisys_2023_0300 10_1515_jisys_2023_0300331
GroupedDBID	0R~ 0~D 4.4 7WY AAFPC AAFWJ AAGVJ AAPJK AAQCX AASOL AASQH AAXCG ABAOT ABAQN ABFKT ABIQR ABSOE ABUVI ABXMZ ABYKJ ACEFL ACGFS ACZBO ADGQD ADGYE ADJVZ ADMLS ADOZN AEJTT AEQDQ AERZL AEXIE AFBAA AFBDD AFCXV AFPKN AFQUK AHGBP AHGSO AIERV AJATJ AKXKS ALMA_UNASSIGNED_HOLDINGS ARCSS BAKPI BBCWN BCIFA CFGNV EBS GROUPED_DOAJ HZ~ IY9 M0C M48 O9- OK1 P2P QD8 RDG SA. SLJYH AAYXX CITATION JQ2
ID	FETCH-LOGICAL-c427t-53fe302b9b12d34fa697ca9c13a43d5f1b71aff759520abacbcd1623604f3f423
IEDL.DBID	M48
ISSN	2191-026X 0334-1860
IngestDate	Wed Aug 27 01:26:23 EDT 2025 Mon Jun 30 13:53:17 EDT 2025 Tue Jul 01 03:02:07 EDT 2025 Thu Apr 24 23:12:30 EDT 2025 Thu Jul 10 10:31:22 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	This work is licensed under the Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c427t-53fe302b9b12d34fa697ca9c13a43d5f1b71aff759520abacbcd1623604f3f423
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.1515/jisys-2023-0300
PQID	3082283231
PQPubID	2031329
PageCount	10
ParticipantIDs	doaj_primary_oai_doaj_org_article_06a41b2391364a72a86095aa394d1a3f proquest_journals_3082283231 crossref_citationtrail_10_1515_jisys_2023_0300 crossref_primary_10_1515_jisys_2023_0300 walterdegruyter_journals_10_1515_jisys_2023_0300331
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-07-18
PublicationDateYYYYMMDD	2024-07-18
PublicationDate_xml	– month: 07 year: 2024 text: 2024-07-18 day: 18
PublicationDecade	2020
PublicationPlace	Berlin
PublicationPlace_xml	– name: Berlin
PublicationTitle	Journal of intelligent systems
PublicationYear	2024
Publisher	De Gruyter Walter de Gruyter GmbH
Publisher_xml	– name: De Gruyter – name: Walter de Gruyter GmbH
References	2024071816334367403_j_jisys-2023-0300_ref_009 2024071816334367403_j_jisys-2023-0300_ref_007 2024071816334367403_j_jisys-2023-0300_ref_018 2024071816334367403_j_jisys-2023-0300_ref_008 2024071816334367403_j_jisys-2023-0300_ref_019 2024071816334367403_j_jisys-2023-0300_ref_005 2024071816334367403_j_jisys-2023-0300_ref_016 2024071816334367403_j_jisys-2023-0300_ref_006 2024071816334367403_j_jisys-2023-0300_ref_017 2024071816334367403_j_jisys-2023-0300_ref_003 2024071816334367403_j_jisys-2023-0300_ref_014 2024071816334367403_j_jisys-2023-0300_ref_004 2024071816334367403_j_jisys-2023-0300_ref_015 2024071816334367403_j_jisys-2023-0300_ref_001 2024071816334367403_j_jisys-2023-0300_ref_012 2024071816334367403_j_jisys-2023-0300_ref_023 2024071816334367403_j_jisys-2023-0300_ref_002 2024071816334367403_j_jisys-2023-0300_ref_013 2024071816334367403_j_jisys-2023-0300_ref_010 2024071816334367403_j_jisys-2023-0300_ref_021 2024071816334367403_j_jisys-2023-0300_ref_011 2024071816334367403_j_jisys-2023-0300_ref_022 2024071816334367403_j_jisys-2023-0300_ref_020
References_xml	– ident: 2024071816334367403_j_jisys-2023-0300_ref_009 doi: 10.18653/v1/D16-1191 – ident: 2024071816334367403_j_jisys-2023-0300_ref_008 doi: 10.1142/S0218213002000861 – ident: 2024071816334367403_j_jisys-2023-0300_ref_001 doi: 10.1016/j.psep.2021.09.022 – ident: 2024071816334367403_j_jisys-2023-0300_ref_010 doi: 10.1016/j.ins.2019.09.013 – ident: 2024071816334367403_j_jisys-2023-0300_ref_003 doi: 10.1088/1757-899X/1131/1/012017 – ident: 2024071816334367403_j_jisys-2023-0300_ref_005 doi: 10.1007/s11192-018-2743-5 – ident: 2024071816334367403_j_jisys-2023-0300_ref_011 doi: 10.1016/j.ipm.2023.103614 – ident: 2024071816334367403_j_jisys-2023-0300_ref_020 doi: 10.1088/1742-6596/1744/4/042135 – ident: 2024071816334367403_j_jisys-2023-0300_ref_018 – ident: 2024071816334367403_j_jisys-2023-0300_ref_013 doi: 10.1002/asi.24279 – ident: 2024071816334367403_j_jisys-2023-0300_ref_002 doi: 10.3233/JIFS-211044 – ident: 2024071816334367403_j_jisys-2023-0300_ref_015 doi: 10.1108/eb026526 – ident: 2024071816334367403_j_jisys-2023-0300_ref_017 doi: 10.1088/1742-6596/1994/1/012031 – ident: 2024071816334367403_j_jisys-2023-0300_ref_004 – ident: 2024071816334367403_j_jisys-2023-0300_ref_006 doi: 10.1016/j.ipm.2021.102802 – ident: 2024071816334367403_j_jisys-2023-0300_ref_007 doi: 10.1016/S0306-4573(00)00050-9 – ident: 2024071816334367403_j_jisys-2023-0300_ref_012 doi: 10.1002/asi.24430 – ident: 2024071816334367403_j_jisys-2023-0300_ref_023 – ident: 2024071816334367403_j_jisys-2023-0300_ref_016 doi: 10.1016/j.eswa.2021.115139 – ident: 2024071816334367403_j_jisys-2023-0300_ref_014 doi: 10.1145/3388971 – ident: 2024071816334367403_j_jisys-2023-0300_ref_019 doi: 10.1149/10701.13329ecst – ident: 2024071816334367403_j_jisys-2023-0300_ref_021 – ident: 2024071816334367403_j_jisys-2023-0300_ref_022 doi: 10.1145/3446132.3446397
SSID	ssj0000491585
Score	2.324936
Snippet	As the number of online news texts continues to increase, the algorithm of automatic keyword extraction becomes a key content in facilitating users’ fast...
SourceID	doaj proquest crossref walterdegruyter
SourceType	Open Website Aggregation Database Enrichment Source Index Database Publisher
StartPage	455
SubjectTerms	68W40 Accuracy Algorithms automatic keyword extraction Extraction processes Information retrieval Keywords News news text precision Recall Semantics term frequency–inverse document frequency Texts
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV3LbtUwELVQV2zKW1woaBYs2JjGsfPwsjyuChKsWqk7a_wqrXoTdJOrq-74h_4hX8I4j9IiVWzYZBFNIstz7DkTT84w9oaCWpY5UXL0ouDK6pqjKpG7OmgbKMkNQx-yr9_Kw2P15aQ4udHqK9WEjfLA48TtZyUqYXOphSwVVjnWSSINUWrlBcqYdl-KeTeSqfOR9woiwpOWD8Xs_fOz7rLjqVk4J1xnt8LQoNZ_i2LubofDah9O15vLfj4cHWLO8iHbncgiHIyDfMTuheYxezA3YoBpXT5h27l-DtoGiNHB0fLXz6vPH5eAF6ctpf_fV0DIoiQ4eEhfXqELK5rSM9cBkVbATd8O0q1AW_V6_NUB2gi0wLeUnJLRul1BM1aMQyLikApGuqfsePnp6MMhnzoqcKfyqueFjEFmudVW5F6qiKWuHGonJCrpiyhsJTDGqtBFnqFFZ50XRJDKTEUZiXk9YztN24TnDMiLsXaudHRRvg5YCOJOsa6F85o4zoK9myfYuEluPHW9uDAp7SCPmMEjJnnEJI8s2NvrB36MSht3m75PHrs2SxLZww0CjpmAY_4FnAXbm_1tpnXbmSTek5o3SbFg8i8M_LG6Y1hSihf_Y2Qv2X16p0pfkEW9x3b69Sa8IurT29cDyn8D8VsDqw priority: 102 providerName: Directory of Open Access Journals
Title	Research on the TF–IDF algorithm combined with semantics for automatic extraction of keywords from network news texts
URI	https://www.degruyter.com/doi/10.1515/jisys-2023-0300 https://www.proquest.com/docview/3082283231 https://doaj.org/article/06a41b2391364a72a86095aa394d1a3f
Volume	33
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZQe-FSKA-xpVQ-cODiEsfOwweEKLAqSOXUlXqz_Fy22k0gyWrZG_-Bf8gv6dhJWhW1Fy6REo2lyDOT-cZ2vg-h11DUksTQnChLM8K1KIniuSKmdEI7aHJd1CE7-5afzvjXi-ziRg5omMD2ztYu6EnNmuXxr5_b95Dw76J6D83eXi7abUuCDjiBkIX-fRfKUhGy9GzA-pc9FKaAjQd6nzvG3apMkcD_Furc28T9a-vmzXrbjfulsQxNH6O9AT_iD73D99EDVz1Bj0ZtBjyk6lO0GY_U4brCAPLw-fTv7z9fPk2xWs7rZtF9X2EINuiLncVhMRa3bgWzvDAtBhyL1bqrI5srhq930__9gGuPIec30K-CUVOvcNUfIscBm-NwhqR9hmbTz-cfT8kgskAMT4uOZMw7lqRaaJpaxr3KRWGUMJQpzmzmqS6o8r7IRJYmSiujjaWAmfKEe-YBjD1HO1VduRcIg2N9aUxu4MJt6VRGAU75sqTGCoA9E3Q8TrA0AwN5EMJYytCJgEdk9IgMHpHBIxP05nrAj558437Tk-Cxa7PAmh0f1M1cDkkok1xxqlMmKMu5KlJVBro9pZjglirmJ-hw9LccI1EGPp-g58ToBLF_YuDG6p7XYowe_Neol-gh3PKwikzLQ7TTNWv3CuBPp4_issFRDO4r5wIIqg
linkProvider	Scholars Portal
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwEB6V7QEuLU-xtIAPHLiEjWPndSyFZQttObCVerMcP7aLdhOUZLXqjf_AP-SXMM4LiuiFSw7JWHI8nsw3k_E3AK_Qqfm-opEnNQ09nqWJJ3kkPZWYNDMY5JqmD9nZeTS74B8vw8sdOO7PwriySm0W5ea6bhlSJ7pQG5coG7gG0ANPvi6r68pzrb893KX-5Kper-7AboTwPxnB7tHsw5fPQ6oFQTBFVNwR-_xj-A2f1FD338Cbe9vmz_UwrT8c0PQ-7HXIkRy1qn4AOyZ_CPt9VwbSGekj2PbFdKTICcI7Mp_-_P7j5N2UyNWiKJf11ZrgG2NEbDRxaVhSmTWu71JVBBEskZu6aHhcCX63y_bcAyksQWvfYqSKQmWxJnlbPk4cKieueqR6DBfT9_Pjmde1V_AUD-LaC5k1zA-yNKOBZtzKKI2VTBVlkjMdWprFVFobh2kY-DKTKlOaIlqKfG6ZRRj2BEZ5kZunQFClNlEqUnjhOjEypAikbJJQpVMEPGN40y-wUB33uGuBsRIuBkGNiEYjwmlEOI2M4fUw4FtLu3G76FunsUHM8WU3N4pyITrzE34kOc0CllIWcRkHMnFEe1KylGsqmR3DYa9v0RlxJRyTj-vkxOgY2F974LfULdNijD77r1Ev4e5sfnYqTk_OPx3APXzEXS6ZJocwqsuNeY4gqM5edJv8F1Z1CLM
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZgKyEu5a1uH-ADBy5h49hx4mNpCVseBYlW6s1y_Nhu1U2qJKtVb_wH_iG_hHGSDS2iFy45JGPJ8Xgy30zG3yD0GpxaGGrCA2VIHLBcpIFiXAU6tSK3EOTatg_Zl2M-PWUfz-KzG2dhfFmlsbNqed10DKkTU-qlT5QNXAPggScX8_q6Dnzr7wB2aTi5Mu4-2uBcUDZCG_vTD9-_DpkWwMAEQHHP6_OP0bdcUsvcfwtubq7aH9fDrG74n-wx2uyBI97vNP0E3bPFU_Ro3ZQB9zb6DK3WtXS4LDCgO3yS_frx8-gww-pyVlbz5nyB4YUhILYG-ywsru0ClneuawwAFqtlU7Y0rhg-21V37AGXDoOxryBQBaGqXOCiqx7HHpRjXzxSP0en2fuTg2nQd1cINIuSJoipszSMcpGTyFDmFBeJVkITqhg1sSN5QpRzSSziKFS50rk2BMASD5mjDlDYCzQqysJuIQwadanWXMOFmdSqmACOcmlKtBGAd8bo7XqBpe6px30HjEvpQxDQiGw1Ir1GpNfIGL0ZBlx1rBt3i77zGhvEPF12e6OsZrK3PhlyxUgeUUEoZyqJVOp59pSighmiqBuj3bW-ZW_DtfREPr6REyVjRP_aA3-k7pgWpWT7v0a9Qg--HWby89Hxpx30EJ4wn0km6S4aNdXS7gEEavKX_R7_DbDHB9k
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Research+on+the+TF%E2%80%93IDF+algorithm+combined+with+semantics+for+automatic+extraction+of+keywords+from+network+news+texts&rft.jtitle=Journal+of+intelligent+systems&rft.au=Wang%2C+Yan&rft.date=2024-07-18&rft.pub=De+Gruyter&rft.eissn=2191-026X&rft.volume=33&rft.issue=1&rft_id=info:doi/10.1515%2Fjisys-2023-0300&rft.externalDBID=n%2Fa&rft.externalDocID=10_1515_jisys_2023_0300331
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2191-026X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2191-026X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2191-026X&client=summon