Efficient partitioning strategies for distributed web crawling

This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The investigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infrastructure between crawlers an...

Full description

Saved in:
Bibliographic Details
Published inInformation Networking. Towards Ubiquitous Networking and Services pp. 544 - 553
Main Authors Exposto, José, Macedo, Joaquim, Pina, António Manuel Silva, Alves, Albano Agostinho Gomes, Rufino, José
Format Conference Proceeding Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer 2008
Springer Berlin Heidelberg
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The investigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infrastructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers' pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-objective partitioning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partitions (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitioning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation time. Fundação para a Ciência e a Tecnologia (FCT)
ISBN:9783540895237
354089523X
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-540-89524-4_54