Deduplication by phrase substitution within chunks of substantially similar content

A method, system and computer program product for phrase substitution within chunks of substantially similar content. The method includes: retrieving from content files a first and a second content chunk which are identical above a predetermined threshold; identifying a candidate for substitution, w...

Full description

Saved in:
Bibliographic Details
Main Authors Samuel, Abigail, Allen, Jr., Lloyd W, Acharya, Alka A, Jenkins, Jana H
Format Patent
LanguageEnglish
Published 28.01.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A method, system and computer program product for phrase substitution within chunks of substantially similar content. The method includes: retrieving from content files a first and a second content chunk which are identical above a predetermined threshold; identifying a candidate for substitution, wherein the candidate for substitution is a string of characters in the second content chunk that is not identical to a corresponding string of characters in the first content chunk; comparing the candidate for substitution with a synonym database to find a match, wherein the synonym database provides a plurality of synonym suggestions to convert the candidate for substitution in the first content chunk and the second content chuck to an identical string of characters; replacing the candidate for substitution with a reference to the identical string of characters; and storing a single copy of the identical string of characters in a common repository.
Bibliography:Application Number: US201514817296