Data Extraction Approach For Retail Crawling Engine

A computer system extracts product data from a website and correlates product records from multiple sources to one another as corresponding to the same product. A website is crawled efficiently by rendering webpages using a virtual browser that ignores blacklisted elements, extracts data from object...

Full description

Saved in:
Bibliographic Details
Main Authors Zaytsev, Andrey, Gilfanov, Ruslan, Aggarwal, Amit
Format Patent
LanguageEnglish
Published 30.03.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A computer system extracts product data from a website and correlates product records from multiple sources to one another as corresponding to the same product. A website is crawled efficiently by rendering webpages using a virtual browser that ignores blacklisted elements, extracts data from objects without rendering, and suppressing retrieval of remote resources. Data is extracted according to engine control statements including a selector and extractor. A website may be crawled repeatedly and changes in extracted data may be detected and flagged. Engine control statements may be automatically changed in response to detecting a change in the configuration of the website. Images of product records may be correlated with one another by first comparing text of the product records and selecting images for comparison based on composition. Images are compared using a machine learning model. Images determined to be similar may be presented to a human for a correlation decision.
Bibliography:Application Number: US202117486562