Webpage Genre Identification Using Variable-Length Character n-Grams
An important factor for discriminating between Web pages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Web page genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based q...
Saved in:
Published in | 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007) Vol. 2; pp. 3 - 10 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.10.2007
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | An important factor for discriminating between Web pages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Web page genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of Web pages including word and HTML-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character n- grams and combine this representation with information about the most frequent HTML-tags. Based on two benchmark corpora, we present Web page genre identification experiments and improve the best reported results in both cases. |
---|---|
ISBN: | 076953015X 9780769530154 |
ISSN: | 1082-3409 2375-0197 |
DOI: | 10.1109/ICTAI.2007.107 |