I have a website have many pages like this:
mywebsite/?page=1
mywebsite/?page=2
...
...
...
mywebsite/?page=n
each page have links to players. when you click on any link, you go to the page of that player.
Users can add players so I will end up with this situation.
Player1
has a link in page=1
.
Player10
has a link in page=2
after an hour
because users have added new players. i will have this situation.Player1
has a link in page=3
Player10
has a link in page=4
and the new players like Player100
and Player101
have links in page=1
I want to scrap on all players to get their information. However, I don't want to scrap on players that I have already scrap. My question is how to user the BaseDupeFilter
in scrapy to identify that this player has been scraped and this not. Remember, I want to sracp on pages
of the website because each page will have different players in each time.
Thank you.
I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:
then your spider may look like: