URL-Based Web Page Classification: With n-Gram Language Models

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly i...

Full description

Saved in:
Bibliographic Details
Published inKnowledge Discovery, Knowledge Engineering and Knowledge Management pp. 19 - 33
Main Authors Abdallah, Tarek Amr, de La Iglesia, Beatriz
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing 2015
SeriesCommunications in Computer and Information Science
Subjects
Online AccessGet full text
ISBN3319258397
9783319258393
ISSN1865-0929
1865-0937
DOI10.1007/978-3-319-25840-9_2

Cover

Loading…
More Information
Summary:There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.
ISBN:3319258397
9783319258393
ISSN:1865-0929
1865-0937
DOI:10.1007/978-3-319-25840-9_2