I want crawler4j to visit pages in such a manner that they belong to domain in seed only. There multiple domains in seed. How can I do it?
Suppose I am adding seed URLs:
- www.google.com
- www.yahoo.com
- www.wikipedia.com
Now I am starting the crawling but I want my crawler to visit pages (just like shouldVisit()
) only in above three domains. Obviously there external links, but I want my crawler to restrict to these domains only. Sub-domain, sub-folders are okay, but not outside these domains.