Erm's I.T. Girl – Zelna Ellis

Don't fear when Zel is near…

An Intro to robots.txt

Intro

The robots.txt protocol also known as Robot Exclusion Standard or Robots Exclusion Protocol basically defines how web spiders and crawlers (search indexing tools) have access to your site.
The Web Administrator can Allow or Disallow access to certain parts of the site, e.g. temporary, private, cgi directories. It is a plain text file, where the rules for the these spiders, crawlers, where they can go and where not. You can also typically block a folder e.g. /images, because you find it both meaningless and a waste of your site’s bandwidth.

Syntax of robots.txt

# All comments is followed by a the pound sign
User-agent : [crawl agent name or * for all]
There are currently three statements you use in robots.txt

  1. Disallow: /path
  2. Allow: /path
  3. Sitemap: http://example.com/sitemap.xml

Note:

  • Crawl-delay: [a number that represent the time in seconds] is not a supported standard, but some bots do obey it.
  • Each crawler user agent section must be separated by a blank line.
  • Not all search engines respects some pattern matching. E.g. to block any directory name that begins with the name picture: Disallow: /picture*/
  • It is recommended if you use multiple User-agents, that the block of code addressed to all spiders (User-agent: *) is the last one. This is to reduce interpretation problems with some old robots.
  • Directives are case-sensitive. E.g. junk.html and Junk.html are seen as two different files.

User-agents

At User-Agents.Org is a very good list of these bots.
At the Web Robot’s page you can see a list of common robots – User-agent names.

Blocking

Blocking the entire site:
Disallow: /

Blocking a specific directory, images, and it’s contents:
Disallow: /images/

Blocking a specific file:
Disallow /junk.html

Blocking a specific file type:
Disallow: /*.xls$

Examples:

# Disallowing all robots

User-agent: *
Disallow: /

# Blocking only one robot, e.g. Altavista’s Scooter and allow all others

User-agent: Scooter
Disallow: /

User-agent: *
Disallow:

Now this is something we learned out of trial and error. In this example it looks like if we want to block MSNBot from a certain directory, and also disallow all other bots (including MSNBOT) from some more directories and give them, except for MSNBOT, access to the directory junk.
THIS DOES NOT WORK LIKE THAT!

User-agent: MSNBot
Disallow: /junk/

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/

You need to disallow the directories for all the bots (User-agent: *) also for MSNBot!!! It should be:

User-agent: MSNBot
Disallow: /junk/
Disallow: /cgi-bin/
Disallow: /private/

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/

Validation:

Here is a few sites, where you can validate your robots.txt file:

Happy codingđŸ™‚

10 November 2009 - Posted by | Web Site Development | , , , , , , ,

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: