Automated specification extraction for consolidated product catalogue
This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial str...
Saved in:
Published in | 2014 IEEE Students' Conference on Electrical, Electronics and Computer Science pp. 1 - 7 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.03.2014
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This paper aims at the development and implementation of a methodology to extract specifications of products from HTML pages containing product details from various e-commerce portals. The extracted resultant data needs to be in a standardised uniform format without any reflection of its initial structure in source format. The most significant problem in designing a solution is the source of the data itself. Since the data is fetched from not just one but many different portals, the sheer variety of it is an obstacle as the format and structure vary for every single portal. The paper considers two subproblems of data available in structured as well as unstructured format. The methodology developed for structured data makes use of the information pattern contained in the underlying tree structure of the page's HTML content from which data is sourced in order to perform extraction. And pattern matching using regular expressions is the concept used for cases where data is unstructured. Implementation has been carried out using Python as the programming language with the usage of tools like Scrapy and LXML. |
---|---|
ISBN: | 9781479925254 147992525X |
DOI: | 10.1109/SCEECS.2014.6804527 |