一种基于熵的文本相似性计算方法

TP391.1; 文本比较是求解两个文本间相似度的过程，文本间的相似度越高代表两个文本越趋于类似。传统的相似度算法主要从字符的角度度量文本的相似性，忽略了文本内多个共同文本串对于文本相似度的影响。针对此问题提出一种基于熵的相似度求解方法，在对文本间字符信息的提取基础上，建立共同子文本串度量维度，然后采用熵的方法进行相似度度量。实验表明，该方法具有更平滑的相似度曲线，从而验证了算法的有效性和准确性。...

Full description

Saved in:

Bibliographic Details
Published in	计算机应用研究 Vol. 33; no. 3; pp. 665 - 668
Main Author	李圣文凌微龚君芳周长征
Format	Journal Article
Language	Chinese
Published	中国地质大学信息工程学院,武汉,430074%国网十堰供电公司,湖北十堰,442000 2016
Subjects	字符串匹配文本相似性最长公共子序列编辑距离算法 Levenshtein distance algorithm 编辑距离算法 text similarity string match 字符串匹配最长公共子序列 longest common sequence 文本相似性
Online Access	Get full text
ISSN	1001-3695
DOI	10.3969/j.issn.1001-3695.2016.03.006

Cover

Loading…

More Information
Summary:	TP391.1; 文本比较是求解两个文本间相似度的过程，文本间的相似度越高代表两个文本越趋于类似。传统的相似度算法主要从字符的角度度量文本的相似性，忽略了文本内多个共同文本串对于文本相似度的影响。针对此问题提出一种基于熵的相似度求解方法，在对文本间字符信息的提取基础上，建立共同子文本串度量维度，然后采用熵的方法进行相似度度量。实验表明，该方法具有更平滑的相似度曲线，从而验证了算法的有效性和准确性。
Bibliography:	51-1196/TP Li Shengwen,Ling Wei,Gong Junfang,Zhou Changzheng（1. School of Information Engineering, China University of Geoscie~es, Wuhan 430074, China; 2. State Grid Shiyan Eleetic Power Company, Shiyan Hubei 442000, China） Text comparison is the process to find similarity between the two texts,the higher similarity between the texts show the two texts tend to like. The traditional method was from the perspective of the similarity measure characters of the text,ignored the text similarity factor of the plural common text string within the text. To address this problem,this paper proposed a text- similarity method based on entropy. The method tried to extract common strings from texts,then established a common sub-measure dimensions,and calculated the similarity based on entropy. Experiments show that the method has a smoother similarity curve,so the algorithm is effective and accuracy. text similarity; string match; Levenshtein distance algorithm; longest common sequence
ISSN:	1001-3695
DOI:	10.3969/j.issn.1001-3695.2016.03.006