Deduplication by phrase substitution within chunks of substantially similar content
A method, system and computer program product for phrase substitution within chunks of substantially similar content. The method includes: retrieving from content files a first and a second content chunk which are identical above a predetermined threshold; identifying a candidate for substitution, w...
Saved in:
Main Authors | , , , |
---|---|
Format | Patent |
Language | English |
Published |
28.01.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | A method, system and computer program product for phrase substitution within chunks of substantially similar content. The method includes: retrieving from content files a first and a second content chunk which are identical above a predetermined threshold; identifying a candidate for substitution, wherein the candidate for substitution is a string of characters in the second content chunk that is not identical to a corresponding string of characters in the first content chunk; comparing the candidate for substitution with a synonym database to find a match, wherein the synonym database provides a plurality of synonym suggestions to convert the candidate for substitution in the first content chunk and the second content chuck to an identical string of characters; replacing the candidate for substitution with a reference to the identical string of characters; and storing a single copy of the identical string of characters in a common repository. |
---|---|
Bibliography: | Application Number: US201514817296 |