I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.
My view is as follows:
def start_crawl(process_number, court):
"""
Starts the crawler.
Args:
process_number (str): Process number to be found.
court (str): Court of the process.
"""
runner = CrawlerRunner(get_project_settings())
results = list()
def crawler_results(sender, parse_result, **kwargs):
results.append(parse_result)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process_info = runner.crawl(MySpider, process_number=process_number, court=court)
return results
I followed this solution but results list is always empty.
I read something as creating a custom middleware and getting the results at the process_spider_output method.
How can I get the desired result?
Thanks!
If you really want to collect all data in a "special" object. Store the data in a separate pipeline like https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter and in close_spider (https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=close_spider#close_spider) you open your django object.
I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.
Overview
I created an
ItemPipeline
that just add the items into aInMemoryItemStore
helper. Then, in my__main__
code I wait for the crawler to finish, and pop all the items out of theInMemoryItemStore
. Then I can manipulate the items as I wish.Code
items_store.py
Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.
pipelines.py
This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change
process_item
to not return all items.settings.py
Now add the
StoreInMemoryPipeline
in the scraper settings. If you change theprocess_item
method above, make sure you set the proper priority here (changing the 100 down here).main.py
This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.