Sites are crawled even when the URL is removed fro

2019-09-08 18:00发布

问题:

I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.

I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?

回答1:

Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.