Python Scrapy - populate start_urls from mysql

2019-03-09 18:05发布

I am trying to populate start_url with a SELECT from a MYSQL table using spider.py. When i run "scrapy runspider spider.py" i get no output, just that it finished with no error.

I have tested the SELECT query in a python script and start_url get populated with the entrys from the MYSQL table.

spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
import MySQLdb


class ProductsSpider(BaseSpider):
    name = "Products"
    allowed_domains = ["test.com"]
    start_urls = []

    def parse(self, response):
        print self.start_urls

    def populate_start_urls(self, url):
        conn = MySQLdb.connect(
                user='user',
                passwd='password',
                db='scrapy',
                host='localhost',
                charset="utf8",
                use_unicode=True
                )
        cursor = conn.cursor()
        cursor.execute(
            'SELECT url FROM links;'
            )
    rows = cursor.fetchall()

    for row in rows:
        start_urls.append(row[0])
    conn.close()

标签： python mysql scrapy web-crawler

2条回答

forever°为你锁心

2楼-- · 2019-03-09 18:36

Write the populating in the __init__:

def __init__(self):
    super(ProductsSpider,self).__init__()
    self.start_urls = get_start_urls()

Assuming get_start_urls() returns the urls.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

3楼-- · 2019-03-09 18:59

A better approach is to override the start_requests method.

This can query your database, much like populate_start_urls, and return a sequence of Request objects.

You would just need to rename your populate_start_urls method to start_requests and modify the following lines:

for row in rows:
    yield self.make_requests_from_url(row[0])

0人赞添加讨论(0) 举报

Python Scrapy - populate start_urls from mysql

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间