So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like:
- Category 1 name
- Subcategory 1 name
- data from this subcategory's page
- Subcategory n name
- data from this page
- Subcategory 1 name
- Category n name
- Subcategory 1 name
- data from subcategory n's page
- Subcategory 1 name
etc.
Eventually i want to be able to use this data with ElasticSearch
I barely have any experience with Scrapy and this is what I have so far (just scrapes the category names from the first page, I have no idea what to do from here)... From my research I believe I need to use a CrawlSpider but am unsure of what that entails. I have also been suggested to use BeautifulSoup. Any help would be greatly appreciated.
class randomSpider(scrapy.Spider):
name = "helpme"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/categories',]
def parse(self, response):
for i in response.css('div.CategoryTreeSection'):
yield {
'categories': i.css('a::text').extract_first()
}