-->

Are Robots.txt and metadata tags enough to stop se

2019-01-20 15:16发布

问题:

I created a php page that is only accessible by means of token/pass received through $_GET

Therefore if you go to the following url you'll get a generic or blank page

http://fakepage11.com/secret_page.php

However if you used the link with the token it shows you special content

http://fakepage11.com/secret_page.php?token=344ee833bde0d8fa008de206606769e4

Of course this is not as safe as a login page, but my only concern is to create a dynamic page that is not indexable and only accessed through the provided link.

Are dynamic pages that are dependent of $_GET variables indexed by google and other search engines?

If so, will include the following be enough to hide it?

  • Robots.txt User-agent: * Disallow: /

  • metadata: <META NAME="ROBOTS" CONTENT="NOINDEX">

Even if I type into google:

site:fakepage11.com/

Thank you!

回答1:

If a search engine bot finds the link with the token somehow¹, it may crawl and index it.

If you use robots.txt to disallow crawling the page, conforming search engine bots won’t crawl the page, but they may still index its URL (which then might appear in a site: search).

If you use meta-robots to disallow indexing the page, conforming search engine bots won’t index the page, but they may still crawl it.

You can’t have both: If you disallow crawling, conforming bots can never learn that you also disallow indexing, because they are not allowed to visit the page to see your meta-robots element.

¹ There are countless ways how search engines might find a link. For example, a user that visits the page might use a browser toolbar that automatically sends all visited URLs to a search engine.



回答2:

If your page isn't discoverable then it will not be indexed.

by "discoverable" we mean:

  1. it is a standard web page, i.e. index.*
  2. it is referenced by another link either yours or from another site

So in your case by using the get parameter for access, you achieve 1 but not necessarily 2 since someone may reference that link and hence the "hidden" page.

You can use the robots.txt that you gave and in that case the page will not get indexed by a bot that respects that (not all will do). Not indexing your page doesn't mean of course that the "hidden" page URL will not be in the wild.

Furthermore another issue - depending on your requirements - is that you use unencrypted HTTP, that means that your "hidden" URLs and content of pages are visible to every server between your server and the user.

Apart from search engines take care that certain services are caching/resolving content when URLs are exchanged for example in Skype or Facebook messenger. In that cases they will visit the URL and try to extract metadata and maybe cache it if applicable. Of course this scenario does not expose your URL to the public but it is exposed to the systems of those services and with them the content that you have "hidden".

UPDATE: Another issue to consider is the exposing of a "hidden" page by linking to another page. In that case in the logs of the server that hosts the linked URL your page will be seen as a referral and thus be visible, that expands also to Google Analytics etc. Thus if you want to remain stealth do not link to another pages from the hidden page.