基于双语LDA的跨语言文本相似度计算方法研究
基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法。先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重。实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性。...
Saved in:
Published in | 计算机工程与科学 Vol. 39; no. 5; pp. 978 - 983 |
---|---|
Main Author | |
Format | Journal Article |
Language | Chinese |
Published |
昆明理工大学信息工程与自动化学院,云南昆明650500
2017
昆明理工大学智能信息处理重点实验室,云南昆明650500 |
Subjects | |
Online Access | Get full text |
ISSN | 1007-130X |
DOI | 10.3969/j.issn.1007-130X.2017.05.024 |
Cover
Abstract | 基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法。先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重。实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性。 |
---|---|
AbstractList | TP391; 基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法.先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重.实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性. 基于双语主题模型思想分析双语文本相似性,提出基于双语LDA跨语言文本相似度计算方法。先利用双语平行语料集训练双语LDA模型,再利用该模型预测新语料集主题分布,将新语料集的双语文档映射到同一个主题向量空间,结合主题分布使用余弦相似度方法计算新语料集双语文档的相似度,使用从类别间和类别内的主题分布离散度的角度改进的主题频率-逆文档频率方法计算特征主题权重。实验表明,改进后的权重计算对于基于双语LDA相似度算法的召回率有较大提高,算法对类别不受限且有较好的可靠性。 |
Abstract_FL | Based on the idea of bilingual topic model,we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA.Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus.The new corpus's bilingual documents are mapped to the vector space of the same topic.We use the cosine similarity method and topic distribution combined to calculate the similarity of the bilingual documents of the new corpus.We improve the topic frequency inverse document frequency method from the aspect of the dispersion of in-category and the between-category topic distribution,and utilize the improved method to calculate feature topic weights.Experimental results show that the improved weight calculation method can enhance the recall rate,enable the LDA similarity calculation algorithm not limited to certain categories,and it is reliable. |
Author | 程蔚 线岩团 周兰江 余正涛 王红斌 |
AuthorAffiliation | 昆明理工大学信息工程与自动化学院,云南昆明650500 昆明理工大学智能信息处理重点实验室,云南昆明650500 |
AuthorAffiliation_xml | – name: 昆明理工大学信息工程与自动化学院,云南昆明650500;昆明理工大学智能信息处理重点实验室,云南昆明650500 |
Author_FL | ZHOU Lan-jiang CHENG Wei XIAN Yan-tuan YU Zheng-tao WANG Hong-bin |
Author_FL_xml | – sequence: 1 fullname: CHENG Wei – sequence: 2 fullname: XIAN Yan-tuan – sequence: 3 fullname: ZHOU Lan-jiang – sequence: 4 fullname: YU Zheng-tao – sequence: 5 fullname: WANG Hong-bin |
Author_xml | – sequence: 1 fullname: 程蔚 线岩团 周兰江 余正涛 王红斌 |
BookMark | eNo9j0tLw0AUhWdRwVr7J8SNi8Q7mWQmA25KfULBTRfuynTyMFGn2iDanYuCG9GNVlBB3UgFrQvBR-jPaZP8DCMVV-dw-Dj3nhlUUC3lIjSPQSec8sVQD6JI6RiAaZjAlm4AZjpYOhhmARX_82lUjqKgCUAtalsMF9HS-D4exefji7Ps7bW2XElvutlnP_dZ_yTpnSZ3L-nt12g4HMdP2eAxHVwnve_k_Sp9uEyfP2bRlCd2I7f8pyVUX12pV9e12ubaRrVS0yTFpmZ70uXSdRwhGRcEuAEOM6ULnDiENKUpLEY9zwQwbSk5EGITW0ibAuf5n5SU0MKk9kgoTyi_EbYO2yo_2Aij0JednePfuWDlY3N2bsLK7ZbyD4Kc3m8He6LdaVBmYEIBG-QHaRRu8A |
ClassificationCodes | TP391 |
ContentType | Journal Article |
Copyright | Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
Copyright_xml | – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
DBID | 2RA 92L CQIGP W92 ~WA 2B. 4A8 92I 93N PSX TCJ |
DOI | 10.3969/j.issn.1007-130X.2017.05.024 |
DatabaseName | 维普_期刊 中文科技期刊数据库-CALIS站点 维普中文期刊数据库 中文科技期刊数据库-工程技术 中文科技期刊数据库- 镜像站点 Wanfang Data Journals - Hong Kong WANFANG Data Centre Wanfang Data Journals 万方数据期刊 - 香港版 China Online Journals (COJ) China Online Journals (COJ) |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
DocumentTitleAlternate | A cross-lingual document similarity calculation method based on bilingual LDA |
DocumentTitle_FL | A cross-lingual document similarity calculation method based on bilingual LDA |
EndPage | 983 |
ExternalDocumentID | jsjgcykx201705024 672136012 |
GrantInformation_xml | – fundername: 国家自然科学基金; 云南省科技厅面上项目; 云南省教育厅科学研究基金; 昆明理工大学省级人培项目 funderid: (61363044,61462054); (2015FB135); (2014Z021); (KKSY201403028) |
GroupedDBID | 2RA 92L ALMA_UNASSIGNED_HOLDINGS CDYEO CQIGP W92 ~WA 2B. 4A8 92I 93N PSX TCJ |
ID | FETCH-LOGICAL-c614-8fce9ceddac79a30920d74ce093d33bc4a576ff40048cc9033838ac8609965663 |
ISSN | 1007-130X |
IngestDate | Thu May 29 04:04:00 EDT 2025 Wed Feb 14 10:02:52 EST 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Keywords | 主题频率-逆文档频率 跨语言文本相似度 topic frequency-inverse document frequency 余弦相似度 bilingual LDA cosine similarity 双语LDA cross-lingual document similarity calculation |
Language | Chinese |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c614-8fce9ceddac79a30920d74ce093d33bc4a576ff40048cc9033838ac8609965663 |
Notes | 43-1258/TP bilingual LDA ; cross-lingual document similarity calculation; cosine similarity; topic fre-quency-inverse document frequency Based on the idea of bilingual topic model, we analyze similarity of bilingual documents and propose a cross-lingual document similarity calculation method based on bilingual LDA. Firstly we use the bilingual parallel documents to train the bilingual LDA model and then use the trained model to predict the topic distribution of the new corpus. The new corpus's bilingual documents are mapped to the vector space of the same topic. We use the cosine similarity method and topic distribution combined to calculate the similarity o{ the bilingual documents of the new corpus. We improve the topic frequency in- verse document frequency method from the aspect of the dispersion of in-category and the between-cate gory topic distribution, and utilize the improved method to calculate feature topic weights. Experimental results show that the improved weight calculation method can enhance the |
PageCount | 6 |
ParticipantIDs | wanfang_journals_jsjgcykx201705024 chongqing_primary_672136012 |
PublicationCentury | 2000 |
PublicationDate | 2017 |
PublicationDateYYYYMMDD | 2017-01-01 |
PublicationDate_xml | – year: 2017 text: 2017 |
PublicationDecade | 2010 |
PublicationTitle | 计算机工程与科学 |
PublicationTitleAlternate | Computer Engineering & Science |
PublicationTitle_FL | Computer Engineering and Science |
PublicationYear | 2017 |
Publisher | 昆明理工大学信息工程与自动化学院,云南昆明650500 昆明理工大学智能信息处理重点实验室,云南昆明650500 |
Publisher_xml | – name: 昆明理工大学信息工程与自动化学院,云南昆明650500 – name: 昆明理工大学智能信息处理重点实验室,云南昆明650500 |
SSID | ssib006568571 ssib017479296 ssib001050383 ssib015938883 ssib001102936 ssib051375740 ssib023646326 ssib036438059 ssib000459496 |
Score | 2.0607347 |
Snippet | ... TP391;... |
SourceID | wanfang chongqing |
SourceType | Aggregation Database Publisher |
StartPage | 978 |
SubjectTerms | 主题频率-逆文档频率 余弦相似度 双语LDA 跨语言文本相似度 |
Title | 基于双语LDA的跨语言文本相似度计算方法研究 |
URI | http://lib.cqvip.com/qk/94293X/201705/672136012.html https://d.wanfangdata.com.cn/periodical/jsjgcykx201705024 |
Volume | 39 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Na9RAFB9qC-JFFBVrVYp0TiU12ZlkZsDLJJulSOvFCr0t2UnSUmGrtgXtyYPgRfSiFVRQL1JB60Hwo_TPaXf3z_DNS5qGokWFJbydmbyPmUzm94a8N4RM8ADmjMmF4-Ws43CR547KDHdYID1Y0PKMY3zF7I1g-ha_Pu_PDx0bqX21tLbamTLrv40r-Z9RhTIYVxsl-w8jWzGFAqBhfOEKIwzXvxpjGvtUtWioacztVca2RLaojGgsqW5R3ZxpQq2gCmq5LQwF1bKqRULazx3igKqASoFERHWEd4U0lMg8sj9gDlJ0gHfFVHu2DRBKlLeHyhIho8rHKpcqjoSiYVDHwX_gEKEtPipZcADdQlRAonVQAlp5tg0or6ttxVpbaWUqPWmLgFvYQoYNq4PtrpDqxiRSHnYE9BdUo_2hB705idKaVKEpVgjDOtAuRJ6ghhWDzHVjv9-i-vZJESdavuvtLi0s4fP1xaDIrFQ-9H7tza6Kk4ZKkKCK03cOrz9MBQrXHytgqhJgvyAsEsQW8eKHMnwH4IUzcIwBSYw0hPD8YTKim7MzN-u4W_FaXkQPs_jUA6JdgG0H9QDSpX-A8wGyMikP2oMXKgAWV-3tIQJBDcfDXyZruNv3mPBFEVy8b9NxMlEafPUoc23KksXl7sJdAGIYF9fNk-5CDcLNnSInS99rXBcT6TQZWl88Q67tvd3e3X669-zJ4MtnmCz9V48G3zeBHmw-7G087r351H_9Y3dnZ2_7w2DrfX_rZW_jZ-_ri_675_2P386SuVY8F0075ZEijgEc6sjcZMpkaZoYoRLmqoabCm4yV7GUsY7hCbjfeY7rmjHKtfs3MjEyAD_KOj7sHBnuLnez82RcdQKWJil3vcTlWSeVCTfg-nupJ1MjEjlKxirD23eKzDHtaqRHyZWyK9rl-2SlvbSytGAe3L7fwBRX0HcXjmQxRk7YlsV24EUyvHpvLbsEAHm1c7l8fH4BvEeMrQ |
linkProvider | EBSCOhost |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%9F%BA%E4%BA%8E%E5%8F%8C%E8%AF%ADLDA%E7%9A%84%E8%B7%A8%E8%AF%AD%E8%A8%80%E6%96%87%E6%9C%AC%E7%9B%B8%E4%BC%BC%E5%BA%A6%E8%AE%A1%E7%AE%97%E6%96%B9%E6%B3%95%E7%A0%94%E7%A9%B6&rft.jtitle=%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%B7%A5%E7%A8%8B%E4%B8%8E%E7%A7%91%E5%AD%A6&rft.au=%E7%A8%8B%E8%94%9A+%E7%BA%BF%E5%B2%A9%E5%9B%A2+%E5%91%A8%E5%85%B0%E6%B1%9F+%E4%BD%99%E6%AD%A3%E6%B6%9B+%E7%8E%8B%E7%BA%A2%E6%96%8C&rft.date=2017&rft.issn=1007-130X&rft.volume=39&rft.issue=5&rft.spage=978&rft.epage=983&rft_id=info:doi/10.3969%2Fj.issn.1007-130X.2017.05.024&rft.externalDocID=672136012 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fimage.cqvip.com%2Fvip1000%2Fqk%2F94293X%2F94293X.jpg http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fjsjgcykx%2Fjsjgcykx.jpg |