The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se
问题:
回答1:
In case you are referring to crawling for some sort of indexing of web resources:
The answer is very long but in short my opinion is that: if you have this "page/resource": https://www.google.se/?q=stackoverflow pointed to by many other pages (i.e. it has a large in-link degree) then not integrating it to your index might mean that you miss a very important node in the webgraph. On the other hand, imagine how many links of this type google.com/q="query" are there on the web. Probably a huge number so this would certainly be a huge overhead for your crawler/indexer system.
回答2:
If the link is visited using a GET request then yes, a web browser should crawl it.
There are still lots of websites which use the query string to identify which content is being requested, e.g. in a blog /article.php?article_id=754. If web browsers didn't follow links like these then lots of content on the web would not get indexed.
回答3:
In your particular example, many websites which offer search ban search engine results pages using /robots.txt.
You do need to crawl pages with cgi args, but it's necessary for a robust crawler to understand cgi args which are either irrelevant or harmful.
Crawling using urchin cgi args (utm_campaign etc.) just means you're going to see duplicate content.
Sites that add a session cgi arg to every fetch not only have duplicate content, but some especially clever sites give an error if you show up with a stale cgi arg! This makes them nearly impossible to crawl.
Some sites have links with cgi args which are dangerous to access., e.g. "delete" buttons in a publicly-editable database.
Google webmaster tools has a way to tell google which cgi args should be ignored for your site, but that's not helpful to other search engines. I don't know of anyone working on a robots.txt extension for this issue.
Over the past 4 years, blekko has accreted an awful regex of args which we delete out of URLs. It's a pretty long list!