I'm looking at crawling thousands of pages and need a solution. Every site has it's own html code - they are all unique sites. No clean datafeed or API is available. I'm hoping to load the captured data into some sort of DB.
Any ideas on how to do this with scrapy if possible?
If I had to scrape clean data from thousands of sites, with each site having its own layout, structure, etc I would implement (and actually have done so in some projects) the following approach:
This goes way beyond building a scrapy scraper of course and requires deep knowlegde and experience in NLP and maybe machine learning.
Also you can't expect to get anywhere close to 100% accurate results from such an approach. Depending on how the algorithms are adjusted and trained such a system either will skip some of the valid data (false negatives) or will pick up data where actually isn't any data (false positives) ... or a mix of both (false positives and false negatives).
Nonetheless I hope my answer helps you to get a good picture about.