An Intro to robots.txt
The robots.txt protocol also known as Robot Exclusion Standard or Robots Exclusion Protocol basically defines how web spiders and crawlers (search indexing tools) have access to your site.
The Web Administrator can Allow or Disallow access to certain parts of the site, e.g. temporary, private, cgi directories. It is a plain text file, where the rules for the these spiders, crawlers, where they can go and where not. You can also typically block a folder e.g. /images, because you find it both meaningless and a waste of your site’s bandwidth.
Syntax of robots.txt
# All comments is followed by a the pound sign
User-agent : [crawl agent name or * for all]
There are currently three statements you use in robots.txt
- Disallow: /path
- Allow: /path
- Sitemap: http://example.com/sitemap.xml
- Crawl-delay: [a number that represent the time in seconds] is not a supported standard, but some bots do obey it.
- Each crawler user agent section must be separated by a blank line.
- Not all search engines respects some pattern matching. E.g. to block any directory name that begins with the name picture: Disallow: /picture*/
- It is recommended if you use multiple User-agents, that the block of code addressed to all spiders (User-agent: *) is the last one. This is to reduce interpretation problems with some old robots.
- Directives are case-sensitive. E.g. junk.html and Junk.html are seen as two different files.
Blocking the entire site:
Blocking a specific directory, images, and it’s contents:
Blocking a specific file:
Blocking a specific file type:
# Disallowing all robots
# Blocking only one robot, e.g. Altavista’s Scooter and allow all others
Now this is something we learned out of trial and error. In this example it looks like if we want to block MSNBot from a certain directory, and also disallow all other bots (including MSNBOT) from some more directories and give them, except for MSNBOT, access to the directory junk.
THIS DOES NOT WORK LIKE THAT!
You need to disallow the directories for all the bots (User-agent: *) also for MSNBot!!! It should be:
Here is a few sites, where you can validate your robots.txt file:
Happy coding 🙂
No comments yet.
- .NET & MySQL
- Off The Beat
- P.C. Troubleshooting
- Software Development
- UNISA – University of South Africa
- Web Sites