Importance of HTML structural elements and metadata in automated subject classification

The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Informatio...

Full description

Saved in:
Bibliographic Details
Published inResearch and advanced technology for digital libraries / Lecture Notes in Computer Science Vol. 3652; p. 368
Main Authors Golub, Koraljka, Ardö, Anders
Format Conference Proceeding Book Chapter
LanguageEnglish
Published 2005
Subjects
Online AccessGet full text
ISBN9783540287674
3540287671
ISSN1611-3349
0302-9743
DOI10.1007/3-540-45747-X

Cover

Loading…
More Information
Summary:The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, semantic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: using the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.
ISBN:9783540287674
3540287671
ISSN:1611-3349
0302-9743
DOI:10.1007/3-540-45747-X