An Intro to robots.txt
Intro
The robots.txt protocol also known as Robot Exclusion Standard or Robots Exclusion Protocol basically defines how web spiders and crawlers (search indexing tools) have access to your site.
The Web Administrator can Allow or Disallow access to certain parts of the site, e.g. temporary, private, cgi directories. It is a plain text file, where the rules for the these spiders, crawlers, where they can go and where not. You can also typically block a folder e.g. /images, because you find it both meaningless and a waste of your site’s bandwidth.
Syntax of robots.txt
# All comments is followed by a the pound sign
User-agent : [crawl agent name or * for all]
There are currently three statements you use in robots.txt
- Disallow: /path
- Allow: /path
- Sitemap: http://example.com/sitemap.xml
Note:
- Crawl-delay: [a number that represent the time in seconds] is not a supported standard, but some bots do obey it.
- Each crawler user agent section must be separated by a blank line.
- Not all search engines respects some pattern matching. E.g. to block any directory name that begins with the name picture: Disallow: /picture*/
- It is recommended if you use multiple User-agents, that the block of code addressed to all spiders (User-agent: *) is the last one. This is to reduce interpretation problems with some old robots.
- Directives are case-sensitive. E.g. junk.html and Junk.html are seen as two different files.
User-agents
At User-Agents.Org is a very good list of these bots.
At the Web Robot’s page you can see a list of common robots – User-agent names.
Blocking
Blocking the entire site:
Disallow: /
Blocking a specific directory, images, and it’s contents:
Disallow: /images/
Blocking a specific file:
Disallow /junk.html
Blocking a specific file type:
Disallow: /*.xls$
Examples:
# Disallowing all robots
User-agent: *
Disallow: /
# Blocking only one robot, e.g. Altavista’s Scooter and allow all others
User-agent: Scooter
Disallow: /User-agent: *
Disallow:
Now this is something we learned out of trial and error. In this example it looks like if we want to block MSNBot from a certain directory, and also disallow all other bots (including MSNBOT) from some more directories and give them, except for MSNBOT, access to the directory junk.
THIS DOES NOT WORK LIKE THAT!
User-agent: MSNBot
Disallow: /junk/User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
You need to disallow the directories for all the bots (User-agent: *) also for MSNBot!!! It should be:
User-agent: MSNBot
Disallow: /junk/
Disallow: /cgi-bin/
Disallow: /private/User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Validation:
Here is a few sites, where you can validate your robots.txt file:
- http://tool.motoricerca.info/robots-checker.phtml
- http://www.invision-graphics.com/robotstxt_validator.html
Happy coding
No comments yet.
Leave a comment
-
Archives
- December 2009 (5)
- November 2009 (11)
- October 2009 (9)
- September 2009 (9)
- August 2009 (12)
- July 2009 (9)
- June 2009 (5)
- May 2009 (16)
- April 2009 (20)
- March 2009 (27)
- February 2009 (6)
-
Categories
-
RSS
Entries RSS
Comments RSS

