Sites are crawled even when the URL is removed fro

2019-09-08 18:12发布

I performed a successful crawl with url-1 in seed.txt and I could see the crawled data in MySQL database. Now when I tried to perform another fresh crawl by replacing url-1 with url-2 in seed.txt, the new crawl started with fetching step and the urls it was trying to fetch is of the old replaced url in seed.txt. I am not sure from where it picked up the old url.

I tried to check for hidden seed files, I didn't find any and there is only one folder urls/seed.txt in NUTCH_HOME/runtime/local where I run my crawl command. Please advise what might be the issue?

1条回答
贪生不怕死
2楼-- · 2019-09-08 18:33

Your crawl database contains a list of URLs to crawl. Unless you delete the original crawl directory or create a new one as part of your new crawl, the original list of URLs will be used and extended with the new URL.

查看更多
登录 后发表回答