Exclude crawlers server-wide with X-Robots-Tag

For a staging site, it's important to exclude crawlers. You wouldn't want your content to get indexed at the wrong URL! The conventional wisdom is to use HTTP Basic authentication. There are some disadvantages to this approach however, and I've found I prefer using a new HTTP header called X-Robots-Tag. Note that this assumes your only objective is to prevent indexing by benevolent crawlers. If you do need to keep secrets this method is obviously unsuitable.

Disadvantages of HTTP Basic

  • It confuses users - you can't log out; you can't even tell if a request is authenticated or not.
  • It prevents testing third-party apps that request resources from your server (sure you could whitelist them, but see the next point).
  • It's just plain annoying.

Advantages of X-Robots-Tag

  • You don't have to modify or redirect robots.txt (useful if your application controls robots.txt, and you want to retain the ability to test on staging).
  • Unlike <meta name="robots"...> it works for all file types, not just HTML.
  • It's easy to add the header in the global Apache httpd.conf.
<Directory />
  # Globally disallow robots from the development sever
  Header Set X-Robots-Tag "noindex, noarchive, nosnippet"
</Directory>
Filed under:

Comments

Informative, however I do not want to modify my httpd.conf file, so I suppose it works the same in .htaccess as well?

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <cpp>, <java>, <php>. The supported tag styles are: <foo>, [foo].
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Ready for transformation?