Get scrapy result inside a Django view

2019-06-05 01:13发布

I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.

My view is as follows:

def start_crawl(process_number, court):
    """
    Starts the crawler.

        Args:
            process_number (str): Process number to be found.
            court (str): Court of the process.
    """
    runner = CrawlerRunner(get_project_settings())
    results = list()

    def crawler_results(sender, parse_result, **kwargs):
        results.append(parse_result)

    dispatcher.connect(crawler_results, signal=signals.item_passed)
    process_info = runner.crawl(MySpider, process_number=process_number, court=court)

    return results

I followed this solution but results list is always empty.

I read something as creating a custom middleware and getting the results at the process_spider_output method.

How can I get the desired result?

Thanks!

标签: django scrapy
2条回答
Luminary・发光体
2楼-- · 2019-06-05 01:45

If you really want to collect all data in a "special" object. Store the data in a separate pipeline like https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter and in close_spider (https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=close_spider#close_spider) you open your django object.

查看更多
乱世女痞
3楼-- · 2019-06-05 02:04

I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.

Overview

I created an ItemPipeline that just add the items into a InMemoryItemStore helper. Then, in my __main__ code I wait for the crawler to finish, and pop all the items out of the InMemoryItemStore. Then I can manipulate the items as I wish.

Code

items_store.py

Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.

class InMemoryItemStore(object):
    __ITEM_STORE = None

    @classmethod
    def pop_items(cls):
        items = cls.__ITEM_STORE or []
        cls.__ITEM_STORE = None
        return items

    @classmethod
    def add_item(cls, item):
        if not cls.__ITEM_STORE:
            cls.__ITEM_STORE = []
        cls.__ITEM_STORE.append(item)

pipelines.py

This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change process_item to not return all items.

from <your-project>.items_store import InMemoryItemStore


class StoreInMemoryPipeline(object):
    """Add items to the in-memory item store."""
    def process_item(self, item, spider):
        InMemoryItemStore.add_item(item)
        return item

settings.py

Now add the StoreInMemoryPipeline in the scraper settings. If you change the process_item method above, make sure you set the proper priority here (changing the 100 down here).

ITEM_PIPELINES = {
   ...
   '<your-project-name>.pipelines.StoreInMemoryPipeline': 100,
   ...
}

main.py

This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

from <your-project>.items_store import InMemoryItemStore
from <your-project>.spiders.your_spider import YourSpider

def get_crawler_items(**kwargs):
    InMemoryItemStore.pop_items()

    process = CrawlerProcess(get_project_settings())
    process.crawl(YourSpider, **kwargs)
    process.start()  # the script will block here until the crawling is finished
    process.stop()
    return InMemoryItemStore.pop_items()

if __name__ == "__main__":
    items = get_crawler_items()
查看更多
登录 后发表回答