I want to scrape a page of data (using the Python Scrapy library) without having to define each individual field on the page. Instead I want to dynamically generate fields using the id
of the element as the field name.
At first I was thinking the best way to do this would be to have a pipeline that collects all the data, and outputs it once it has it all.
Then I realised that I need to pass the data to the pipeline in an item, but I can't define an item as I don't know what fields it will need!
What's the best way for me to tackle this problem?
This works with version 0.24 and also allows Items to work with Item Loaders:
This solution works with the exporters (
scrapy crawl -t json -o output.json
):EDIT: updated to work with latest Scrapy
Update:
The old method didn't work with item loaders and was complicating things unnecessarily. Here's a better way of achieving a flexible item:
Result:
Old solution:
Okay, I've found a solution. It's a bit of "hack" but it works..
A Scrapy Item stores the field names in a dict called
fields
. When adding data to an Item it checks if the field exists, and if it doesn't it throws and error:What you can do is override this
__setitem__
function to be less strict:And there you go.
Now when you add data to an Item, if the item doesn't have that field defined, it will be added, and then the data will be added as normal.
I know that my answer is late, but for those who still need a dynamic items using Scrapy (current version is 1), I created a repository on Github including an example.
Here you go
https://github.com/WilliamKinaan/ScrapyDynamicItems