Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Internet Marketing - Matt Bailey [169]

By Root 781 0

and the .txt extension. It should look like Figure 14-15.

Figure 14-15: The robots.txt file, set to allow the entire website to be spidered by the search engines

Upload this file using an FTP program to the root level of your website. Congratulations! You now have put out the welcome mat for the search engines. If you are curious to know more, read on. If not, jump ahead to the review and get started evaluating and working on your site!

Learning robots.txt Structure

Only two lines are required for a standard robots.txt file. The first line identifies the robots you want to specifically command.

User-agent: *

The asterisk is a wildcard, meaning “all robots—follow these instructions.”

The second line tells the robots where not to go, which is defined either at the directory level or at the page level.

Disallow:

If you don’t want to disallow anything, then don’t type another character! That’s the typical setup to allow the search engines free reign of your website. It’s as simple as that.

Now, some people get a little fancy and like to disallow certain directories. This is usually done to remove any duplicate content. So, let’s say I have a directory of all my printer-friendly pages, which are really only duplicates of the HTML pages.

User-agent: *

Disallow: /printerfriendly/

I’ve disallowed the entire directory of printer-friendly pages by specifically naming it to the search engines.

The forward slash is an important part of this file. Where most people make their mistakes is with that slash. The forward slash indicates a directory, and anything contained in the directory after the first forward slash will be disallowed. A forward slash by itself indicates that you want to block the entire root directory and anything contained within the structure. Ouch! That would be a big mistake!

Accidentally Blocking Your Website

By adding a slash to the disallow command, like this:

Disallow:/

you are telling the search engines to “go away” with this command. This would be disastrous for most websites, and it happens often.

The reason that this happens often is from development projects where the robots.txt is used to block search engines from indexing the work “in progress.” When the site goes live, the development team will just copy everything from a staging server or development directory onto the new server. The robots.txt file simply gets copied over with the new website and gets forgotten. Only when the rankings fall and the new site does not appear in the search engines does someone realize that there is a problem. To remedy this, consider using an .htaccess password rather than robots.txt to block access. The password access would be nearly impossible to “accidentally” deploy on a new site, because it would be noticed immediately.

Google’s Webmaster Tools has a function (under Site Configuration ⇒ Crawler Access) that will show you the status of your robots.txt file and all of the documents and directories that are being blocked. This tool shows you the last time that the file was requested and the pages that are disallowed from Google’s index. At the bottom of the page, you can test your robots.txt file to be sure that Googlebot and Googlebot-Mobile (Google’s search engine spiders) are able to successfully access your website. If you have set up your account using Google’s Webmaster Tools, the name of the website and the text of the robots.txt should be shown in this resource. Click the test button at the bottom of the page, and Google will show you if it is able to access your site using the current robots.txt protocol (see Figure 14-16).

Figure 14-16: A successful robots.txt test in Google’s Webmaster Tools

As with most of these resources, if it works in Google’s Webmaster Tools, it will work with other search engines. If Google is not able to access your site because of robots.txt, then Yahoo!, Bing, Ask, and others will not be able to access your site either.

Additionally, the robots.txt is only a protocol. Not all search engines nor all bots follow the protocol.

Online Book Reader

Internet Marketing - Matt Bailey [169]

®Online Book Reader