SYSTEM AND METHOD FOR DETECTING DUPLICATE CONTENT ITEMS
Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected web...
Saved in:
Main Authors | , , |
---|---|
Format | Patent |
Language | English |
Published |
14.05.2009
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Generally, the present invention provides systems, methods and computer program products for detecting different content items with similar content by examining the anchortext of the link. A method of the present invention comprises selecting one of a plurality of websites, crawling the selected website to identify one or more content items, and downloading one or more content items of the selected website. A determination is then made as to the one or more linking relationships from the one or more content items of the selected website and one or more linking rules are learned based upon association rule mining of the one or more content items. The one or more linking rules are then applied to one or more content items of one or more websites in order to determine storage of the one or more content items based upon the one or more linking rules on a search provider's central server. |
---|---|
Bibliography: | Application Number: US20070939834 |