Webbots, Spiders, and Screen Scrapers - Michael Schrenk [114]
However futile the attempt, you should still use the robots.txt file if for no other reason than to mark your turf. If you are serious about securing your site from webbots and spiders, however, you should use the the tactics described later in this chapter.
Use the Robots Meta Tag
Like the robots.txt file, the intent of the robots meta tag[76] is to warn spiders to stay clear of your website. Unfortunately, this tactic suffers from many of the same limitations as the robots.tx file, because it also lacks an enforcement mechanism. A typical robots meta tag is shown in Listing 27-1.
Listing 27-1: The robots meta tag
There are two main commands for this meta tag: noindex and nofollow. The first command tells spiders not to index the web page in search results. The second command tells spiders not to follow links from this web page to other pages. Conversely, index and follow commands are also available, and they achieve the opposite effect. These commands may be used together or independently.
The problem with site usage policies, robots.txt files, and meta tags is that the webbots visiting your site must voluntarily honor your requests. On a good day, this might happen. On its own, a Terms of Service policy, a robots.txt file, or a robots meta tag is something short of a social contract, because a contract requires at least two willing parties. There is no enforcing agency to contact when someone doesn't honor your requests. If you want to deter webbots and spiders, you should start by asking nicely and then move on to the tougher approaches described next.
* * *
[73] The filename robots.txt is case sensitive. It must always be lowercase.
[74] Each website should have only one robots.txt file.
[75] The robots.txt specification is available at http://www.robotstxt.org,
[76] The specification for the robots meta tag is available at http://www.robotstxt.org/wc/meta_user.html
Building Speed Bumps
Better methods of deterring webbots are ones that make it difficult for a webbot to operate on a website. Just remember, however, that a determined webbot designer may overcome these obstacles.
Selectively Allow Access to Specific Web Agents
Some developers may be tempted to detect their visitors' web agent names and only serve pages to specific browsers like Internet Explorer or Firefox. This is largely ineffective because a webbot can pose as any web agent it chooses.[77] However, if you insist on implementing this strategy, make sure you use a server-side method of detecting the agent, since you can't trust a webbot to interpret JavaScript.
Use Obfuscation
As you learned in Chapter 20, obfuscation is the practice of hiding something through confusion. For example, you could use HTML special characters to obfuscate an email link, as shown in Listing 27-2.
Please email me at:
Listing 27-2: Obfuscating the email address me@addr.com with HTML special characters
While the special characters are hard for a person to read, a browser has no problem rendering them, as you can see in Figure 27-2.
You shouldn't rely on obfuscation to protect data because