How to Optimize a Robots.txt File for Your Site or Blog
A robots.txt file is a plain text file that sits in the root directory of your web site. It should not be stored under any sub-directory, as web crawlers look for the robots.txt file to be in the root directory. Why is a robots.txt file so important?
It is so important because it allows you to have some control over how the web spiders such as Googlebot, Yahoo Slurp, and others crawl your web site. I have found most web sites on the Internet have a very basic robots.txt that pretty much allows every crawler to crawl their whole site. Many times this is desirable, however if you use WordPress for example there may be certain portions of your site you may not want to be crawled. This can be true in the case of Google and Yahoo who will penalize a site for duplicate content. So it is very important your robots.txt file is set up correctly to avoid these issues, and insure your site is crawled optimally.
So without further delay, Lets go over the basics of the robots text file.
A robots.txt file contains the following:
1. a line or lines stating a User-Agent
2. A line or lines using the Disallow command with various parameters.
Example robots.txt to allow all crawlers access.
User-agent: *
Disallow:
Example robots.txt to disallow all crawlers from your site.
User-agent: *
Disallow: /
Example Robots.txt to disallow one specific crawler, but allow all others.
User-agent: *
Disallow:
User-agent: MSNbot
Disallow: /
The example above would forbid MSNbot from crawling your site, yet allow all other robots to crawl it.
Example Robots.txt to disallow a particular directory.
User-agent: *
Disallow: /cgi-bin
The web crawlers read the robots.txt file for instructions on how to crawl your site. I have seen many people using the “Allow” directive” While it is true that some web crawlers support the Allow Directive, it is not recommended to use it, and it is not part of the official Robots Exclusion Standard.
The reason I recommend against using the Allow Directive is because whatever isn’t explicitly Disallowed is crawled by default. So there is no real reasons to use the Allow Directive, and it may cause problems with crawlers that don’t support it.
The * character is called a “Wild Card”, and it can be used in file path on a Disallow directive or as a Wild card in the User-Agent string to represent all web robots. Below I will give you an example of the wild card in use.
Example robots.txt using wildcard character
User-agent: *
Disallow: /*/private
The above would disallow all robots from crawling a directory named private in any sub folder. as the /*/private where the (star = insert any folder name here) in the file path.
Now earlier in this post, I mentioned naming a particular User-agent such as MSNbot. In using a robots.txt file, you can actually create different rules for different robots. For example you could allow Googlebot access to all folders but private, and allow Yahoo Slurp access to the whole site including private. This give you a lot of flexibility in how you want your site indexed. You can also include the location of your sitemap.xml file in your robots.txt file to tell the crawlers where your site map is located.
Below, I will give you an example of multiple User-agents.
Example of multiple User-agents and sitemap in a robots.txt file.
User-agent: MSNbot Disallow: /wp-login.php Disallow: /wp-login.php?* Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes User-agent: Teoma Disallow: /wp-login.php Disallow: /wp-login.php?* Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /blog/wp-content/cache Disallow: /blog/wp-content/themes User-agent: * Disallow:
Sitemap: http://example.com/sitemap.xml
As you can see above, being able to control what robots access on your server can be beneficial to your web sites SEO at times. This is especially true if you have a WordPress blog or another content management system that you would would rather have certain archive pages, category pages, and tag pages not indexed. The robots.txt file will allow you to declare that so the crawlers won’t index those portions of your web site. Also, being able to reference your sitemap.xml file in your robots.txt will ensure the crawlers know the locations of all the pages on your site so they will get spidered and indexed.
Now for the last part, below I will give you a list of the most common User-Agents.
List of common User-Agents
- MSNbot – Microsoft Bing/Live Search Crawler
- Googlebot – Google crawler
- Slurp – Yahoo Crawler
- Teoma – Ask Search Engine Crawler
- ia_archiver – Way Back Machine Internet Archive Crawler
- Googlebot-Image – Google Image Search Crawler
- Googlebot-Mobile – Google Mobile Web Crawler
- MSNbot-Media – Microsoft Bing/Live Search Image/Video Crawler
- Adsbot-Google – Google Adsense Crawler
- MediaPartners-Google – Google Media Crawler
Now that you are armed with the information above, you should be well on your way to mastery of the robots.txt file to optimize your web site. However I want to stress a few words of caution. do not rely on a robots.txt file to hide sensitive information on your web server from robots. There are many “bad” robots out there that do not honor your robots.txt file. These robots will scan anything and everything on your web site regardless of what is included in your robots text file. If you have information you do not want out there, consider not putting it up at all or password protecting that directory on your web server. I wanted to point this out so someone doesn’t misunderstand what the robots file is intended to do.
Good luck, I hope this article helps anyone who has questions concerning how to set up a robots.txt file.





[...] original here: How to Optimize a Robots.txt File for Your Site or Blog | Raygen's … By admin | category: crawler | tags: australia, crawl-their, crawler, internet, [...]