How to crawl thousands of pages using scrapy?

I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.

Any ideas on how to do this with scrapy if possible?

标签： python scrapy web-crawler

1条回答

▲ chillily

2楼-- · 2019-03-07 08:58

If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:

Crawler - a scrapy script that crawls these sites with all their subpages (that's the easiest part) and transforms them into plain text
NLP Processing - some basic NLP (natural language) processing (tokenizing, part of speech (POS) tagging, named entity-recognition (NER)) on the plain text
Classification - a classifier that can use the data from step 2 to decide whether a page contains the data we're looking for - either simple rules based or - if needed - using machine learning. Those pages that are suspected to contain any usable data will be put into the next step:
Extraction - an grammar-based, statistical or machine learning based extractor that uses POS-tags and NER-tags (and any other domain specific factors) to extract that data we're looking for
Clean up - some basic matching of duplicate records that were created in step 4 and maybe it's also necessary to throw away records that had low confidence scores in steps 2 to 4.

This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.

Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).

Nonetheless I hope my answer helps you to get a good picture about.

0人赞添加讨论(0) 举报

How to crawl thousands of pages using scrapy?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间