Which web crawler for extracting and parsing data

I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only.

Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in.

I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Heritrix crashes about every day, and no attemps with JVM parameters to limit memory usage were successful).

From your experiences in the field, which crawler would you use for extracting and parsing content from a thousand of sources?

标签： web-crawler

3条回答

虎瘦雄心在

2楼-- · 2019-03-16 19:36

Wow. State of the art crawlers like the search engines use crawl and index 1 million URLs On a sinlge box a day. Sure the HTML to XML rendering step takes a bit but I agree with you on the performance. I've only used private crawlers so I can't recommend one you'll be able to use but hope this performance numbers help in your evaluation.

0人赞添加讨论(0) 举报

爷的心禁止访问

3楼-- · 2019-03-16 19:53

I would not use the 2.x branch (which has been discontinued) or the 3.x (current development) for any 'serious' crawling unless you want to help improve Heritrix or just like being on the bleeding edge.

Heritrix 1.14.3 is the most recent stable release and it really is stable, used by many institutions for both small and large scale crawling. I'm using to run crawls against tens of thousands of domains, collecting tens of millions of URLs in under a week.

The 3.x branch is getting closer to a stable release, but even then I'd wait a bit for general use at The Internet Archive and others to improve its performance and stability.

Update: Since someone up-voted this recently I feel it is worth noting that Heritrix 3.x is now stable and is the recommended version for those starting out with Heritrix.

0人赞添加讨论(0) 举报

等我变得足够好

4楼-- · 2019-03-16 19:55

I would suggest writing your own using Python with the Scrapy and either lxml or BeautifulSoup packages. You should find a few good tutorials in Google for those. I use Scrapy+lxml at work to spider ~600 websites checking for broken links.

0人赞添加讨论(0) 举报

Which web crawler for extracting and parsing data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间