Hadoop big data similarity duplicate removal method and device
The embodiment of the invention discloses a Hadoop big data similarity deduplication method and device, and the method comprises the steps: obtaining a data deduplication instruction which carries name information; determining field information and threshold information corresponding to the name inf...
Saved in:
Main Authors | , , , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
01.03.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The embodiment of the invention discloses a Hadoop big data similarity deduplication method and device, and the method comprises the steps: obtaining a data deduplication instruction which carries name information; determining field information and threshold information corresponding to the name information; similarity comparison is carried out on the multiple pieces of field content corresponding to the field information, and similarity values are obtained; and based on the similarity value and the threshold information, performing a de-duplication operation on the plurality of field contents. According to the scheme, the similar data existing in the Hadoop distributed file system can be subjected to duplicate removal operation, so that the Hadoop storage space can be saved, the operation performance of each component of the Hadoop distributed file system is improved, and the daily operation and maintenance cost is reduced.
本申请实施例公开了一种Hadoop大数据相似度去重方法及装置,其中,方法包括:获得数据去重指令,所述数据去重指令中携带名称信息;确定所述名称信息对应的字段信息和阈值信息; |
---|---|
Bibliography: | Application Number: CN202311653720 |