SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS

Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected web...

Full description

Saved in:
Bibliographic Details
Main Authors BHATTACHARJEE ARNABNIL, AHUJA RAJAT, SCHONFELD URI
Format Patent
LanguageEnglish
Published 14.05.2009
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected website to identify one or more content items, and downloading one or more content items of the selected website. A determination is then made as to the one or more linking relationships from the one or more content items of the selected website and one or more linking rules are learned based upon association rule mining of the one or more content items. The one or more linking rules are then applied to one or more content items of one or more websites in order to determine storage of the one or more content items based upon the one or more linking rules on a search provider's central server.
Bibliography:Application Number: US20070939834