基于Hadoop的广域网分布式主题爬虫系统框架

广域网分布式爬虫与局域网爬虫相比有诸多的优势，而现有基于Hadoop分布式爬虫的设计主要是面向局域网环境的。为解决Hadoop分布式计算平台不适合部署于广域网的问题，设计了一个基于Hadoop的广域网分布式爬虫系统框架。爬虫系统利用消息中间件实现分布式可靠通信，数据存储采用可伸缩的Hadoop分布式文件系统HDFS，网页解析利用MapReduce并行处理，并基于模板匹配实现框架可定制。系统的性能仿真显示该框架具有支撑大规模爬虫并发工作的能力。...

Full description

Saved in:

Bibliographic Details
Published in	计算机工程与科学 Vol. 37; no. 4; pp. 670 - 675
Main Author	王淑芬高军礼邹普宋海涛
Format	Journal Article
Language	Chinese
Published	广东工业大学自动化学院,广东广州,510006%华南理工大学工商管理学院,广东广州,510641 2015
Subjects	Hadoop 主题爬虫分布式爬虫模板匹配爬虫框架 WAN based distributed crawler templates matching 爬虫框架主题爬虫 Hadoop 分布式爬虫 topic crawler crawling system framework 模板匹配
Online Access	Get full text
ISSN	1007-130X
DOI	10.3969/j.issn.1007-130X.2015.04.008

Cover

More Information
Summary:	广域网分布式爬虫与局域网爬虫相比有诸多的优势，而现有基于Hadoop分布式爬虫的设计主要是面向局域网环境的。为解决Hadoop分布式计算平台不适合部署于广域网的问题，设计了一个基于Hadoop的广域网分布式爬虫系统框架。爬虫系统利用消息中间件实现分布式可靠通信，数据存储采用可伸缩的Hadoop分布式文件系统HDFS，网页解析利用MapReduce并行处理，并基于模板匹配实现框架可定制。系统的性能仿真显示该框架具有支撑大规模爬虫并发工作的能力。
Bibliography:	WAN based distributed crawler; Hadoop; crawling system framework; templates matching ; topic crawler WANG Shu-fen ,GAO Jun-li ,ZOU Pu ,SONG Hai-tao （1. School of Automation,Guangdong University of Technology, Guangzhou 510006; 2. School of Business Administration,South China University of Technology,Guangzhou 510641 ,China） 43-1258/TP Comparing with LAN crawling systems, WAN distributed crawling systems have lots of advantages, however, the existing crawling systems based on Hadoop are mostly used in LAN. To a- chieve a high computing speed of Hadoop in WAN, we present a crawler framework based on Hadoop. To achieve an extensible storage, all data are stored on the Hadoop distributed file system and the web pages are analyzed through MapReduce in parallel. To obtain reliable communication, a message oriented middleware is used. To make the framework customizable, a template matching method is proposed. The performance simulation shows that the crawler framework can support large scale crawling work.
ISSN:	1007-130X
DOI:	10.3969/j.issn.1007-130X.2015.04.008