Automated specification extraction for consolidated product catalogue

This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial str...

Full description

Saved in:
Bibliographic Details
Published in2014 IEEE Students' Conference on Electrical, Electronics and Computer Science pp. 1 - 7
Main Authors Hareendran, Stuthi, Parashar, Anuvrat, Khan, Farhat Ullah
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2014
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial structure in source format. The most significant problem in designing a solution is the source of the data itself. Since the data is fetched from not just one but many different portals, the sheer variety of it is an obstacle as the format and structure vary for every single portal. The paper considers two subproblems of data available in structured as well as unstructured format. The methodology developed for structured data makes use of the information pattern contained in the underlying tree structure of the page's HTML content from which data is sourced in order to perform extraction. And pattern matching using regular expressions is the concept used for cases where data is unstructured. Implementation has been carried out using Python as the programming language with the usage of tools like Scrapy and LXML.
ISBN:9781479925254
147992525X
DOI:10.1109/SCEECS.2014.6804527