robots.txt :: friend or foe. Use this tool with care. It can go wrong.

robots.txt :: friend or foe. Use this tool with care. It can go wrong.

2009-02-21 0 By midboh admin

The pitfalls and benefits of using a robots.txt file

The robot is coming

If you like to check your web server log files, you might see an entry referring to a file called “robots.txt”. If you don’t have one, your log file will show a 404 error (page not found), but if you have one it will just be listed as one of the files accessed by certain visitors to your site.

Have one or not, you might be wondering what it is and if it’s worthwhile. While it’s not a “must have”, for most sites it is a useful addition.

What’s the purpose of a robots.txt file?

Most of us are aware of the search engine “robots” that hopefully trawl our sites on a regular basis, looking for new content to feast on. The robots.txt file is one way to give these robots some clues about where they should look within your site. If you want some specific information about these robots, it may be useful to visit The Web Robots Pages.

It may seem a little back to front, but this device is based on an “exclusion” protocol, i.e. you tell the robots what they can’t touch.

The benefits of using a robots.txt file?

The robots.txt file is mostly a set of instructions for the robots about areas of a site that should be avoided. Adding these parts of your website to a robots.txt file can reduce the time taken by all robots at you site, and in the process reducing bandwidth.

In addition, you will keep certain parts of your site out of search engine indexes. This is not a strong form of security, but it can keep work in progress or temporary content from being included in search results.

Are there problems using a robots.txt file?

Unfortunately yes. Here are the ones we have identified so far:

  • Not all Search Engine robots follow the protocol.
  • It is not a security system.
  • If poorly implemented it can prevent the whole site being included the search engine indexes. *
  • By default, all pages and directories are included (because this is an exclusion protocol i.e. you specify those files and directories that are to be skipped by the robot).

* One site we were asked to help with couldn’t work out why they could not get listed. They had used Google’s site map facility to list their pages, had good internal navigation, robots were visiting the site but were not indexing pages. We discovered the site had a corrupt robots.txt file that effectively meant everything was excluded from the robots. Not a great way to be generating search engine traffic.

Use of the robots.txt file is worthwhile but make sure it is set up correctly so it does exactly what you want and make sure you update it if you add new sections of your website that you wish to exclude from the search engine indexes.

If you want any help understanding the use of robots.txt, please contact us and we’ll try to answer any questions you have.

photo credit: jiuguangw via photopin cc