一种基于熵的文本相似性计算方法

TP391.1; 文本比较是求解两个文本间相似度的过程,文本间的相似度越高代表两个文本越趋于类似。传统的相似度算法主要从字符的角度度量文本的相似性,忽略了文本内多个共同文本串对于文本相似度的影响。针对此问题提出一种基于熵的相似度求解方法,在对文本间字符信息的提取基础上,建立共同子文本串度量维度,然后采用熵的方法进行相似度度量。实验表明,该方法具有更平滑的相似度曲线,从而验证了算法的有效性和准确性。...

Full description

Saved in:
Bibliographic Details
Published in计算机应用研究 Vol. 33; no. 3; pp. 665 - 668
Main Author 李圣文 凌微 龚君芳 周长征
Format Journal Article
LanguageChinese
Published 中国地质大学 信息工程学院,武汉,430074%国网十堰供电公司,湖北 十堰,442000 2016
Subjects
Online AccessGet full text
ISSN1001-3695
DOI10.3969/j.issn.1001-3695.2016.03.006

Cover

Loading…
More Information
Summary:TP391.1; 文本比较是求解两个文本间相似度的过程,文本间的相似度越高代表两个文本越趋于类似。传统的相似度算法主要从字符的角度度量文本的相似性,忽略了文本内多个共同文本串对于文本相似度的影响。针对此问题提出一种基于熵的相似度求解方法,在对文本间字符信息的提取基础上,建立共同子文本串度量维度,然后采用熵的方法进行相似度度量。实验表明,该方法具有更平滑的相似度曲线,从而验证了算法的有效性和准确性。
Bibliography:51-1196/TP
Li Shengwen,Ling Wei,Gong Junfang,Zhou Changzheng(1. School of Information Engineering, China University of Geoscie~es, Wuhan 430074, China; 2. State Grid Shiyan Eleetic Power Company, Shiyan Hubei 442000, China)
Text comparison is the process to find similarity between the two texts,the higher similarity between the texts show the two texts tend to like. The traditional method was from the perspective of the similarity measure characters of the text,ignored the text similarity factor of the plural common text string within the text. To address this problem,this paper proposed a text- similarity method based on entropy. The method tried to extract common strings from texts,then established a common sub-measure dimensions,and calculated the similarity based on entropy. Experiments show that the method has a smoother similarity curve,so the algorithm is effective and accuracy.
text similarity; string match; Levenshtein distance algorithm; longest common sequence
ISSN:1001-3695
DOI:10.3969/j.issn.1001-3695.2016.03.006