Hadoop big data similarity duplicate removal method and device

The embodiment of the invention discloses a Hadoop big data similarity deduplication method and device, and the method comprises the steps: obtaining a data deduplication instruction which carries name information; determining field information and threshold information corresponding to the name inf...

Full description

Saved in:

Bibliographic Details
Main Authors	XU JILAI, LUO XIAOFENG, DU TENGFEI, ZHANG YANTANG, LI RUICHEN
Format	Patent
Language	Chinese English
Published	01.03.2024
Subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The embodiment of the invention discloses a Hadoop big data similarity deduplication method and device, and the method comprises the steps: obtaining a data deduplication instruction which carries name information; determining field information and threshold information corresponding to the name information; similarity comparison is carried out on the multiple pieces of field content corresponding to the field information, and similarity values are obtained; and based on the similarity value and the threshold information, performing a de-duplication operation on the plurality of field contents. According to the scheme, the similar data existing in the Hadoop distributed file system can be subjected to duplicate removal operation, so that the Hadoop storage space can be saved, the operation performance of each component of the Hadoop distributed file system is improved, and the daily operation and maintenance cost is reduced. 本申请实施例公开了一种Hadoop大数据相似度去重方法及装置，其中，方法包括：获得数据去重指令，所述数据去重指令中携带名称信息；确定所述名称信息对应的字段信息和阈值信息；
Bibliography:	Application Number: CN202311653720