Exclude crawlers server-wide with X-Robots-Tag
For a staging site, it's important to exclude crawlers. You wouldn't want your content to get indexed at the wrong URL! The conventional wisdom is to use HTTP Basic authentication. There are some disadvantages to this approach however, and I've found I prefer using a new HTTP header called
X-Robots-Tag. Note that this assumes your only objective is to prevent indexing by benevolent crawlers. If you do need to keep secrets this method is obviously unsuitable.
Disadvantages of HTTP Basic
- It confuses users - you can't log out; you can't even tell if a request is authenticated or not.
- It prevents testing third-party apps that request resources from your server (sure you could whitelist them, but see the next point).
- It's just plain annoying.
Advantages of X-Robots-Tag
- You don't have to modify or redirect robots.txt (useful if your application controls robots.txt, and you want to retain the ability to test on staging).
<meta name="robots"...>it works for all file types, not just HTML.
- It's easy to add the header in the global Apache httpd.conf.
<Directory /> # Globally disallow robots from the development sever Header Set X-Robots-Tag "noindex, noarchive, nosnippet" </Directory>