Scrapy crawlers not running simultaneously from Py

2019-09-02 07:32发布

I was just wondering why this might be occurring. Here is my Python script to run all:

from scrapy import cmdline

file = open('cityNames.txt', 'r')
cityNames = file.read().splitlines()

for city in cityNames:
    url = "http://" + city + ".website.com"
    output = city + ".json"

    cmdline.execute(['scrapy', 'crawl', 'backpage_tester', '-a', "start_url="+url, '-o', ""+output])

cityNames.txt:

chicago
sanfran
boston

It runs the through the first city fine, but then stops after that. It doesn't run sanfran or boston - only chicago. Any thoughts? Thank you!

标签： python command-line scrapy

1条回答

来，给爷笑一个

2楼-- · 2019-09-02 08:17

Your method is using synchronous calls. You should use asynchronous calls in Python (asyncio?) or use a bash script that iterates over a text file of your urls:

cat urls.txt | xargs -0 -I{} scrapy crawl spider_name -a start_url={}

this should issue one scrapy process per url. However, be warned-- this could easily overload your system if those crawls are extensive and deep on each site, and your spiders are not properly configured.

0人赞添加讨论(0) 举报

Scrapy crawlers not running simultaneously from Py

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间