Webbots, Spiders, and Screen Scrapers - Michael Schrenk [115]
Figure 27-2. A browser rendering of the obfuscated script in Listing 27-2
Use Cookies, Encryption, JavaScript, and Redirection
Lesser webbots and spiders have trouble handling cookies, encryption, and page redirection, so attempts to deter webbots by employing these methods may be effective in some cases. While PHP/CURL resolves most of these issues, webbots still stumble when interpreting cookies and page redirections written in JavaScript, since most webbots lack JavaScript interpreters. Extensive use of JavaScript can often effectively deter webbots, especially if JavaScript creates links to other pages or if it is used to create HTML content.
Authenticate Users
Where possible, place all confidential information in password-protected areas. This is your best defense against webbots and spiders. However, authentication only affects people without login credentials; it does not prevent authorized users from developing webbots and spiders to harvest information and use services within password-protected areas of a website. You can learn about writing webbots that access password-protected websites in Chapter 21.
Update Your Site Often
Possibly the single most effective way to confuse a webbot is to change your site on a regular basis. A website that changes frequently is more difficult for a webbot to parse than a static site. The challenge is to change the things that foul up webbot behavior without making your site hard for people to use. For example, you may choose to randomly take one of the following actions:
Change the order of form elements
Change form methods
Rename files in your website
Alter text that may serve as convenient parsing reference points, like form variables
These techniques may be easy to implement if you're using a high-quality content management system (CMS). Without a CMS, though, it will take a more deliberate effort.
Embed Text in Other Media
Webbots and spiders rely on text represented by HTML codes, which are nothing more than numbers capable of being matched, compared, or manipulated with mathematical precision. However, if you place important text inside images or other non-textual media like Flash, movies, or Java applets, that text is hidden from automated agents. This is different from the obfuscation method discussed earlier, because embedding relies on the reasoning power of a human to react to his or her environment. For example, it is now common for authentication forms to display text embedded in an image and ask a user to type that text into a field before it allows access to a secure page. While it's possible for a webbot to process text within an image, it is quite difficult. This is especially true when the text is varied and on a busy background, as shown in Figure 27-3. This technique is called a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA).[79] You can find more information about CAPTCHA devices at this book's website.
Before embedding all your website's text in images, however, you need to recognize the downside. When you put text in images, beneficial spiders, like those used by search engines, will not be able to index your web pages. Placing text within images is also a very inefficient way to render text.
Figure 27-3. Text within an image is hard for a webbot to interpret
* * *
[77] Read Chapter 3 if you are interested in browser spoofing.
[78] To learn the difference between obfuscation and encryption, read Chapter 20.
[79] Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) is a registered trademark of