Find all the web pages in a domain and its subdoma

2019-05-11 13:44发布

I am looking for a way to find all the web pages and sub domains in a domain. For example, in the uoregon.edu domain, I would like to find all the web pages in this domain and in all the sub domains (e.g., cs.uoregon.edu).

I have been looking at nutch, and I think it can do the job. But, it seems that nutch downloads entire web pages and indexes them for later search. But, I want a crawler that only scans a web page for URLs that belong to the same domain. Furthermore, it seems that nutch saves the linkdb in a serialized format. How can I read it? I tried solr, and it can read nutch's collected data. But, I dont think I need solr, since I am not performing any searches. All I need are the URLs that belong to a given domain.

Thanks

2条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-05-11 13:58

If you're familiar with ruby, consider using anemone. Wonderful crawling framework. Here is sample code that works out of the box.

require 'anemone'

urls = []

Anemone.crawl(site_url)
  anemone.on_every_page do |page|
    urls << page.url
  end
end

https://github.com/chriskite/anemone

Disclaimer: You need to use a patch from the issues to crawl subdomains and you might want to consider adding a maximum page count.

查看更多
来,给爷笑一个
3楼-- · 2019-05-11 14:04

The easiest way to find all subdomains of a given domain is to ask the DNS administrators of the site in question to provide you with a DNS Zone Transfer or their zone files; if there are any wildcard DNS entries in the zone, you'll have to also get the configurations (and potentially code) of the servers that respond to requests on the wildcard DNS entries. Don't forget that portions of the domain name space might be handled by other DNS servers -- you'll have to get data from them all.

This is especially complicated because HTTP servers might have different handling for requests to different names baked into their server configuration files, or the application code running the servers, or perhaps the application code running the servers will perform database lookups to determine what to do with the given name. FTP does not provide for name-based virtual hosting, and whatever other services you're interested in may or may not provide name-based virtual hosting protocols.

查看更多
登录 后发表回答