Content Code Blurring: A New Approach to Content Extraction

Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content an...

Full description

Saved in:

Bibliographic Details
Published in	2008 19th International Workshop on Database and Expert Systems Applications pp. 29 - 33
Main Author	Gottron, T.
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2008
Subjects	Algorithm design and analysis content code blurring Content Extraction Data mining Feature extraction HTML main content detection Manuals Presses web information retrieval Web sites
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurring delivers the best results.
ISBN:	9780769532998 0769532993
ISSN:	1529-4188
DOI:	10.1109/DEXA.2008.43