基于Hadoop的广域网分布式主题爬虫系统框架

广域网分布式爬虫与局域网爬虫相比有诸多的优势,而现有基于Hadoop分布式爬虫的设计主要是面向局域网环境的。为解决Hadoop分布式计算平台不适合部署于广域网的问题,设计了一个基于Hadoop的广域网分布式爬虫系统框架。爬虫系统利用消息中间件实现分布式可靠通信,数据存储采用可伸缩的Hadoop分布式文件系统HDFS,网页解析利用MapReduce并行处理,并基于模板匹配实现框架可定制。系统的性能仿真显示该框架具有支撑大规模爬虫并发工作的能力。...

Full description

Saved in:
Bibliographic Details
Published in计算机工程与科学 Vol. 37; no. 4; pp. 670 - 675
Main Author 王淑芬 高军礼 邹普 宋海涛
Format Journal Article
LanguageChinese
Published 广东工业大学自动化学院,广东广州,510006%华南理工大学工商管理学院,广东广州,510641 2015
Subjects
Online AccessGet full text
ISSN1007-130X
DOI10.3969/j.issn.1007-130X.2015.04.008

Cover

More Information
Summary:广域网分布式爬虫与局域网爬虫相比有诸多的优势,而现有基于Hadoop分布式爬虫的设计主要是面向局域网环境的。为解决Hadoop分布式计算平台不适合部署于广域网的问题,设计了一个基于Hadoop的广域网分布式爬虫系统框架。爬虫系统利用消息中间件实现分布式可靠通信,数据存储采用可伸缩的Hadoop分布式文件系统HDFS,网页解析利用MapReduce并行处理,并基于模板匹配实现框架可定制。系统的性能仿真显示该框架具有支撑大规模爬虫并发工作的能力。
Bibliography:WAN based distributed crawler; Hadoop; crawling system framework; templates matching ; topic crawler
WANG Shu-fen ,GAO Jun-li ,ZOU Pu ,SONG Hai-tao (1. School of Automation,Guangdong University of Technology, Guangzhou 510006; 2. School of Business Administration,South China University of Technology,Guangzhou 510641 ,China)
43-1258/TP
Comparing with LAN crawling systems, WAN distributed crawling systems have lots of advantages, however, the existing crawling systems based on Hadoop are mostly used in LAN. To a- chieve a high computing speed of Hadoop in WAN, we present a crawler framework based on Hadoop. To achieve an extensible storage, all data are stored on the Hadoop distributed file system and the web pages are analyzed through MapReduce in parallel. To obtain reliable communication, a message oriented middleware is used. To make the framework customizable, a template matching method is proposed. The performance simulation shows that the crawler framework can support large scale crawling work.
ISSN:1007-130X
DOI:10.3969/j.issn.1007-130X.2015.04.008