Different robots.txt for staging server on Heroku

2019-02-13 17:36发布

问题:

I have staging and production apps on Heroku.

For crawler, I set robots.txt file.

After that I got message from Google.

Dear Webmaster, The host name of your site, https://www.myapp.com/, does not match any of the "Subject Names" in your SSL certificate, which were:
*.herokuapp.com
herokuapp.com

The Google bot read the robots.txt on my staging apps and send this message. because I didn't set anything for preventing crawlers to read the file.

So, what I'm thinking about is to change .gitignore file between staging and production, but I can't figure out how to do this.

What are the best practices for implementing this?

EDIT

I googled about this and found this article http://goo.gl/2ZHal

This article says to set basic Rack authentication and you won't need to care about robots.txt.

I didn't know that basic auth can prevent google bot. It seems this solution is better that manipulate .gitignore file.

回答1:

What about serving /robots.txt dynamically using a controller action instead of having a static file? Depending on the environment you allow or disallow search engines to index your application.



回答2:

A great solution with Rails 3 is to use Rack. Here is a great post that outlines the process: Serving Different Robots.txt Using Rack. To summarize, you add this to your routes.rb:

 # config/routes.rb
 require 'robots_generator' # Rails 3 does not autoload files in lib 
 match "/robots.txt" => RobotsGenerator

and then create a new file inside lib/robots_generator.rb

# lib/robots_generator.rb
class RobotsGenerator
  # Use the config/robots.txt in production.
  # Disallow everything for all other environments.
  # http://avandamiri.com/2011/10/11/serving-different-robots-using-rack.html
  def self.call(env)
    body = if Rails.env.production?
      File.read Rails.root.join('config', 'robots.txt')
    else
      "User-agent: *\nDisallow: /"
    end

    # Heroku can cache content for free using Varnish.
    headers = { 'Cache-Control' => "public, max-age=#{1.month.seconds.to_i}" }

    [200, headers, [body]]
  rescue Errno::ENOENT
    [404, {}, ['# A robots.txt is not configured']]
  end
end

Finally make sure to include move robots.txt into your config folder (or wherever you specify in your RobotsGenerator class).