This is the continuation to the 20 Tips to SEO series. We’ve already covered Google Penalty, Image Optimization and Link Building Tips. Today, let’s talk about optimizing the Robots.txt file for websites and blogs.
1. Include your sitemap file URL in Robots.txt
2. If you have lot of images on the site, add an image sitemap and if you have videos on the site, consider adding a video sitemap along with the regular one.
3. Make sure you exclude all the scripts folders, admin folders and all backend stuff from indexing.
4. Go to Google Webmasters Tool and check that your Robotz.txt is set right.
You can do it here Site Configuration > Crawler Access
5. To generate a Robots.txt file do not depend on third party SEO tools except Google Webmasters Tool. There is an efficient Robots.txt generator tool at Site Configuration > Crawler Access > Generate Robots.txt
6. The general syntax to be written in the Robots file is this.
Here, user-agent:* means all search agents(Google,MSN,Yahoo etc).
/yourfolder/ restricts that folder from crawling. Note that the sub-folders will not be crawled too.
7. Specifying Image location for Google Image Robots.
If you have lot of images, specifying a particular folder to crawl for Google Image robots is a good idea. In the above example, wp-content/uploads/ is the directory where images are.
8. Typically you can set the user agents to * making it applicable to all bots/user agents. But there are several different user agents like Google(which itself has different bots). To know the entire list of user agents, check this out.
9. Exclude unwanted URLs using Robots.txt
If there are URLs you don’t want the search engines to crawl, use the following syntax.
In the above example, all the URLs beginning with /directory/folder won’t be crawled.
10. If you find that a file excluded via Robots.txt is indexed on Google (probably via backlinks from other pages or sites), you can use the Meta NoIndex tags to get it excluded.
11. Robots.txt is not the surest way to exclude or include a file or folder on search engines. There could be mistakes. I suggest you use the meta index/noindex files and also check the URL cralability from Google Webmasters Tool to cross check.
12. Make sure you follow the syntax as described by the standards here.
13. The Robots.tx tool is not the surest way to block or remove a URL from being indexed on Google. Use the URL Removal Tool inside Google Webmasters Tool to get this done.
14. To comment within the Robots.txt file use the hash symbol.
# Comments go here.
15. Do not include all the folders and file names in one line. The right syntax is to arrange them in each line by folder.
16. URL paths and folder names are case sensitive. Do not make typos or they are as good as not being there.
17. Using the “Allow” command.
Some selected crawlers like Google supports a new command called the “Allow” command. It lets you particularly dictate what files/folders should be crawled. However, this field is currently not part of the "robots.txt" protocol.
18. Robots.txt for blogger and wordpress.com users.
Blogger users cannot upload the robots.txt file instead, they can use the robots meta tag to control the crawling of bots on particular files.
19. Even if your site is a sub-directory, make sure that your Robots.txt is in the root directory always. This is a standard.
20. Make sure your robots.txt file have the right access permissions and not writable by all.