Should a web-crawler pick up queries? [closed]

2019-08-17 18:17发布

问题:

The latest days I have coded a web-crawler. The only question I have left is, does "standard" web-crawlers crawl links queries like this one: https://www.google.se/?q=stackoverflow or does it skip the queries and pick them up like this: https://www.google.se

回答1:

In case you are referring to crawling for some sort of indexing of web resources:

The answer is very long but in short my opinion is that: if you have this "page/resource": https://www.google.se/?q=stackoverflow pointed to by many other pages (i.e. it has a large in-link degree) then not integrating it to your index might mean that you miss a very important node in the webgraph. On the other hand, imagine how many links of this type google.com/q="query" are there on the web. Probably a huge number so this would certainly be a huge overhead for your crawler/indexer system.



回答2:

If the link is visited using a GET request then yes, a web browser should crawl it.

There are still lots of websites which use the query string to identify which content is being requested, e.g. in a blog /article.php?article_id=754. If web browsers didn't follow links like these then lots of content on the web would not get indexed.



回答3:

In your particular example, many websites which offer search ban search engine results pages using /robots.txt.

You do need to crawl pages with cgi args, but it's necessary for a robust crawler to understand cgi args which are either irrelevant or harmful.

Crawling using urchin cgi args (utm_campaign etc.) just means you're going to see duplicate content.

Sites that add a session cgi arg to every fetch not only have duplicate content, but some especially clever sites give an error if you show up with a stale cgi arg! This makes them nearly impossible to crawl.

Some sites have links with cgi args which are dangerous to access., e.g. "delete" buttons in a publicly-editable database.

Google webmaster tools has a way to tell google which cgi args should be ignored for your site, but that's not helpful to other search engines. I don't know of anyone working on a robots.txt extension for this issue.

Over the past 4 years, blekko has accreted an awful regex of args which we delete out of URLs. It's a pretty long list!