Robots.txt versus .htaccess

The robots.txt file seems so simple, but you can screw things up ever so slightly and end up not getting the results you want. The good news is, once you figure out the problem, the SEs generally pick it up pretty quickly. The .htaccess file is similar, except that it’s more powerful and therefore more confusing to a lot of people. I, and most webmasters, only really use .htaccess for redirecting.

What’s important is understanding how these two files work together, and can work against each other if you make a mistake.

Some basics about robots.txt

As you probably already know, robots.txt tells SE robots which parts of your site they should visit and which they should ignore. For example, it’s a good idea to block the administrative pages of your blog that visitors can’t get into without a password. What would be the point of having that show up in the SERPs? You may also want to block pages that you don’t want to pass pagerank onto. (“Sculpting pagerank”, as this practice is called, is controversial. Generally, I don’t think it’s that valuable, but some people do.) You may also want to block a private set of pages that’s only viewable to subscribers.

There are a few pages you never ever want to block via robots.txt: your privacy page, contact page and, of course, your homepage.

Blocking versus redirecting

Redirecting (usually by a 301) is done in .htaccess, and it sends a visitor’s browser from one URL to another. For example, I recently got rid of my old directory and decided to send its considerable number of visitors to my advertising page, so I set up a redirect in htaccess:

redirect 301 /sites/ http://bluemushrooms.com/advertise

At the time I did this, I had “/sites/” blocked in my robots.txt file. This meant the search engines couldn’t pick up on the fact that the “sites” folder was gone and they should now be indexing the “advertise” page instead. I had to remove the line blocking “sites” in robots.txt:

User-agent: *
Disallow: /sites/

This is important to understand if you really want the SEs to lose a page or subfolder of your domain. If you wanted to block the SEs without redirecting visitors anywhere, you would just add a block to robots.txt. If you just wanted to redirect visitors without blocking SE bots, you would just set up a redirect in .htaccess.

Related posts:
  1. Robots.txt file validator
  2. Trying a new approach to blocking IPs
  3. Yahoo’s “robots-nocontent” class tag
  4. Robots.txt file
  5. To www or not to www: avoid the dupe penalty!

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>