Robots.txt - Telling the search engines what they can and cannot index
When your site is indexed by the search engines, it is "crawled" by the search engine spiders - GoogleBot, Yahoo Slurp, Bingbot - in order to find all the content on your site, so that other people can find it.
But what if you've got sections of your website that you don't want indexed? The bots dumbly index whatever they can find - they don't know that, for example, those photos on the hidden part of your site are strictly friends and family only, or that there are certain pages in your website that you'd really rather not have popping up in the search engine listings or being archived by that pesky internet archive bot — like your long-expired special offers. In this lesson we look at robots.txt - telling the search engines what they can and cannot index.
What is the robots.txt file?
Robots.txt is a small text document that lives in the root of your website and tells the "robots" visiting your website which pages they can and cannot access. When one of these "robots" visits your site, the first thing they do is go looking for the robots.txt file. They listen to your requests, and won't visit pages that you've disallowed.
How do you make a robots.txt file?
Decide which areas of your website you want the spiders to index, and which ones you don't want them crawling through. And decide if there are any bots you would rather not have crawling through your site.
Open up your plaintext editor of choice, create a new, blank text file and save it as robots.txt, then write this information into the file:
To block all spiders from your entire website:
User-agent: * Disallow: /
To let all spiders see all content on your site:
User-agent: * Disallow:
To block certain directories:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /personal/ Disallow: /photos/staffchristmasparty/
To block a certain spider:
User-agent: Googlebot Disallow: /
To allow a certain spider, while blocking others:
User-agent: Googlebot Disallow: User-agent: * Disallow: /
- You must use a new line for each instruction.
- Blank lines are used to show separate groups of instructions (as in the last example).
- The asterisk in the User-agent line has a special meaning in robots.txt and can't be used as a wildcard; if you wanted to disallow all GIF images on your website, you couldn't just can't just go Disallow: *.gif - that won't work.
- Your file must be called robots.txt, all in lower-case.
- Your file must be located in the root directory of your website: www.yoursite.com/robots.txt. That's where the spiders look when they visit your site, and they won't find it if you put it anywhere else.
Now simply save your file and upload it to your website.
Robots.txt and your XML sitemap
If you've seen our lesson on creating XML sitemaps, you'll know that your robots.txt file is a really handy place to let the search engines know where that is.
All you have to do is leave a blank line after the last command in your robots.txt file, and then paste this little line:
If you've got more than one sitemap, you can enter more than one line.
Sitemap: <http://www.example.com/sitemap1.xml> Sitemap: <http://www.example.com/sitemap2.xml> Sitemap: <http://www.example.com/sitemap3.xml>
This way you don't need to specifically tell each and every search engine where they can find your sitemap. They'll see it as soon as they look for your robots.txt file, which every polite bot will do when they visit your site anyway.
Things you need to know
Not all spiders honor robots.txt
"Polite" spiders, such as those belonging to the major search engines, are polite and won't index items you've listed in your robots.txt file. However, not all robots are polite (for example, from smaller search engines, or general data scraping bots), so they will collect any and all content anyway.
Your robots.txt is publicly accessible!
Don't try to use your robots.txt file to hide content on your site - the robots.txt file is able to be viewed by anybody, simply by typing www.yoursite.com/robots.txt into their browser, so anybody can see the things you've said you don't want indexed!
If there's content on your website that you really, really don't want anybody else seeing, your best bet is to password-protect that directory. There will usually be a tool to help you do this in your hosting control panel (cPanel or similar). Note that password-protecting your comment (if done right) will also prevent the "unpolite" bots from accessing the content
In this lesson we've looked at robots.txt - what it is, what it's used for, and how to create one. We've looked at certain things you can do with robots.txt including:
- Blocking your entire site from indexing
- Blocking certain directories
- Blocking certain bots
- Identifying the location of the sitemap