An Intro to robots.txt


The robots.txt protocol also known as Robot Exclusion Standard or Robots Exclusion Protocol basically defines how web spiders and crawlers (search indexing tools) have access to your site.
The Web Administrator can Allow or Disallow access to certain parts of the site, e.g. temporary, private, cgi directories. It is a plain text file, where the rules for the these spiders, crawlers, where they can go and where not. You can also typically block a folder e.g. /images, because you find it both meaningless and a waste of your site’s bandwidth.

Syntax of robots.txt

# All comments is followed by a the pound sign
User-agent : [crawl agent name or * for all]
There are currently three statements you use in robots.txt

  1. Disallow: /path
  2. Allow: /path
  3. Sitemap:


  • Crawl-delay: [a number that represent the time in seconds] is not a supported standard, but some bots do obey it.
  • Each crawler user agent section must be separated by a blank line.
  • Not all search engines respects some pattern matching. E.g. to block any directory name that begins with the name picture: Disallow: /picture*/
  • It is recommended if you use multiple User-agents, that the block of code addressed to all spiders (User-agent: *) is the last one. This is to reduce interpretation problems with some old robots.
  • Directives are case-sensitive. E.g. junk.html and Junk.html are seen as two different files.


At User-Agents.Org is a very good list of these bots.
At the Web Robot’s page you can see a list of common robots – User-agent names.


Blocking the entire site:
Disallow: /

Blocking a specific directory, images, and it’s contents:
Disallow: /images/

Blocking a specific file:
Disallow /junk.html

Blocking a specific file type:
Disallow: /*.xls$


# Disallowing all robots

User-agent: *
Disallow: /

# Blocking only one robot, e.g. Altavista’s Scooter and allow all others

User-agent: Scooter
Disallow: /

User-agent: *

Now this is something we learned out of trial and error. In this example it looks like if we want to block MSNBot from a certain directory, and also disallow all other bots (including MSNBOT) from some more directories and give them, except for MSNBOT, access to the directory junk.

User-agent: MSNBot
Disallow: /junk/

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/

You need to disallow the directories for all the bots (User-agent: *) also for MSNBot!!! It should be:

User-agent: MSNBot
Disallow: /junk/
Disallow: /cgi-bin/
Disallow: /private/

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/


Here is a few sites, where you can validate your robots.txt file:

Happy coding 🙂


Bing or Google it?

What is Bing? It is Microsoft’s new search engine. Formally code-named "Kumo", the new search tool replaces Microsoft’s Live Search. I’ve never been a fan of Live Search and if I used it 5 times it’s a lot.

Bing is definite a big improvement of Live search. It has big background picture or splash page that changes daily. On the top you have links for Web, Images, Videos, Shopping, News, Maps, More, MSN, Windows Live and so on. Under Extras, on the top right hand side, you will find Preferences, Blogs and Advertising. It displays sponsored links on the right hand side. Related searches on the left hand side, with Google you have to click the "Show Options" button to see the list. Search result is in the middle with. The rest of the searches the usual stuff (The use of the Boolean operators (OR, NOT, AND) within a search term. Search in a search within a specific site or domain. And so on…)

One feature that caught my eye is that Bing lets you take a quick peek before opening a link. If you hover with your mouse cursor over a search result you get a pop-up box where you can read a longer excerpt from the page without visiting the site.

When I search for something and find the answer, I’m content. I don’t understand the "The War of the Search Engines" and don’t care who is numero uno. Will I switch? Properbly not, Bing doesn’t give me a reason to switch.
Decide for yourself: Bing or Google

