27
Feb

Robots.txt for the Rest of Us

The robots.txt file.

robots on your websiteNothing can be more confusing to a website owner as the robots.txt file. Born out of technology in the programming world, the robot.txt file is nothing more than a server command
for search engines. Unfortunately, while search engines understand the file, humans have a difficult time understanding machine language.

The Google blog is now running a two-part series on understanding robots.txt and the robots
meta tag. Both of these articles, while providing a lot of great in-depth information, is much more than any site owner or manager wants to know. Especially when you start talking technology, bots, spiders, permissions, etc.. Most owners don’t know where to start, nor do they understand the technology behind either of these issues. What people really want to know is “what do I need to do?”

In fact, most website marketers don’t care. They just want it done.

Just tell me what to do
Most people just want to know what to do, where to put it and be done with it. So if that’s you – just go to the bottom of the article and you get what you need. Otherwise, for those that are curious, but don’t like technical explanations; I’m going to explain it the best that I can, but in terms the common man, like me, can understand.

Robots.txt Explained. Sort of . . .
The best way to explain the robots.txt file, is that it is a ‘welcome mat’ for the search engines. It’s not so much that the file is necessary for search engine success, but it’s one of those hundreds of small things that you need to consider, much like everything in SEO. If you have it, it will help your search engine success in a very small way. If you don’t have it, it won’t harm you, it’s simply a technical issue.

The technical issue is that the search engines request this file before or during every spidering session. Some request it before every session, some request prior to groups of pages. Either way, search engines request this file multiple times in a session and in a day. If the file does not exist, then it shows up as a ‘page not found error’ in your log files. This is is getting borderline technical so I’ll stop here with this explanation. So, if the search engines request it, it must be important. That’s why i believe that it is important to have.

Welcome Home
welcome mat I like to explain it as a ‘welcome mat’ because some people have a welcome mat at the entrance of their house and some people don’t. Either way, it doesn’t prevent people from coming into the house. The same for the robots.txt file, it simply tells search engines that they are welcome to visit the site.

Don’t Go There!
If you want to get fancy with your welcome mat, you can tell the search engine where not to go in your house. Go Away MatTypically, these are files that are not important to the search engines or files that you don’t want showing up in the search results. It’s kind of like that closet where you store all your junk. When people come over you don’t want them to go into the closet. It’s not vital for them to know its in there, as it’s stuff you typically store out of sight. For a website, some people “disallow” printer friendly pages, images, or directories that they do not want to show up in the search results.

Its Not for Security
Now, I am not saying to used this as a way of protecting information that you don’t want people to see. If that is the case, then you need to put that behind a password. The robots.txt file is not to hide information from people. It simply to tell the search engines not to index it.

Knowing this is really what’s important from a marketing standpoint, the technical standpoint is a little more difficult, because it gets into server commands, which most people frankly don’t understand. Frankly, I’m surprised how many times I run unto problems with the robots.txt as the culprit. This little file has been the cause of a lot of problems for some very large websites.

The Robots.txt Structure
There are only two lines required for a standard robots.txt file. The first line identifies the robots you want to specifically command.

User-agent: *

The asterisk is a wildcard, meaning: all robots – follow these instructions.

The second line does allow tells the robots where not to go, which is defined either at the directory level or the page level.

Disallow:
welcome mat - small

If you don’t want to disallow anything then don’t put anything there. That’s the typical set-up to allow the search engines free reign of your website.

It’s as simple as that. And here is what it looks like, written in a notepad file.
robots.txt image

Adding and Removing
Now, some people get a little fancy and like to disallow certain directories. This is usually done to remove any duplicate content. So, let’s say I have a directory of all of my printer-friendly pages, which are really only duplicates of the HTML pages.

User-agent: *
Disallow: /printerfriendly/

I’ve disallowed the entire directory by specifically naming it to the search engines.

The forward slash is an important part of this file. That is where most people make their mistakes, is with that slash.

Blocking your Website
By adding a slash to the disallow command, like this:

Disallow:/
go away - and stay out

You are telling the search engines to “go away” with this command.

More info
If you want more inormation about the robots.txt file and all the things you can do with it, I suggest the following resources:

Robotstxt.org
Official Google Blog – The Robots Exclusion, pt 1
Google Blog – The Robots Exclusion, pt 2 (not published yet)
Robot.txt Code Generation Tool

Summary
Hopefully, this has helped a few understand the place and purpose of the robots.txt file. Even more than that, I hope that it has taken the fear away from dealing with this file. Many site managers are very gun-shy, as they may have had a disallowed site from the search engines with a misplaced slash at one time or another.

If you have any questions about this file, feel free to leave them in the comments. I and many others are very willing to help you understand what you need to know about the robots.txt file.

It’s better to ask questions and be sure that you are making the right move than to guess and disallow your entire website . . .

About Matt Bailey
Matt is the owner and founder of SiteLogic and has over 15 years in the internet marketing industry. He focuses on consulting and training to help companies take control of their websites and marketing strategies. You can find out more by reading his book: Internet Marketing: An Hour a Day

3 Comments for this entry

Gennady
February 28th, 2007 on 3:38 pm

Thanks Matt, for this great writeup. Very simple to understand.

Perhaps you can do a part 2 about implementation of the robots.txt file?

Matt Bailey
March 7th, 2007 on 12:50 pm

Sounds like a deal, Gennady. I’ll get started on that today.

Theatons Toys
May 28th, 2008 on 7:28 am

Considering your opening line is “Nothing can be more confusing to a website owner as the robots.txt file.” .. it sure turns out that actually robots.txt file is anything but confusing !

Thanks very much for the help. :)











+ 5 = twelve