URL-Based Web Page Classification: With n-Gram Language Models

There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly i...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge Discovery, Knowledge Engineering and Knowledge Management pp. 19 - 33
Main Authors	Abdallah, Tarek Amr, de La Iglesia, Beatriz
Format	Book Chapter
Language	English
Published	Cham Springer International Publishing 2015
Series	Communications in Computer and Information Science
Subjects	Information retrieval Language models Machine learning Web classification Web mining
Online Access	Get full text
ISBN	3319258397 9783319258393
ISSN	1865-0929 1865-0937
DOI	10.1007/978-3-319-25840-9_2

Cover

Loading…

More Information
Summary:	There are some situations these days in which it is important to have an efficient and reliable classification of a web-page from the information contained in the Uniform Resource Locator (URL) only, without the need to visit the page itself. For example, a social media website may need to quickly identify status updates linking to malicious websites to block them. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Methods proposed for this task, for example, the all-grams approach which extracts all possible sub-strings as features, provide reasonable accuracy but do not scale well to large datasets. We have recently proposed a new method for URL-based web page classification. We have introduced an n-gram language model for this task as a method that provides competitive accuracy and scalability to larger datasets. Our method allows for the classification of new URLs with unseen sub-sequences. In this paper we extend our presentation and include additional results to validate the proposed approach. We explain the parameters associated with the n-gram language model and test their impact on the models produced. Our results show that our method is competitive in terms of accuracy with the best known methods but also scales well for larger datasets.
ISBN:	3319258397 9783319258393
ISSN:	1865-0929 1865-0937
DOI:	10.1007/978-3-319-25840-9_2