Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [116]

By Root 290 0
Carnegie Mellon University.

Setting Traps

Your strongest defenses against webbots are techniques that detect webbot behavior. Webbots behave differently because they are machines and don't have the reasoning ability of people. Therefore, a webbot will do things that a person won't do, and a webbot lacks information that a person either knows or can figure out by examining his or her environment.

Create a Spider Trap

A spider trap is a technique that capitalizes on the behavior of a spider, forcing it to identify itself without interfering with normal human use. The spider trap in the following example exploits the spider behavior of indiscriminately following every hyperlink on a web page. If some links are either invisible or unavailable to people using browsers, you'll know that any agent that follows the link is a spider. For example, consider the hyperlinks in Listing 27-3.

Listing 27-3: Two spider traps

There are many ways to trap a spider. Some other techniques include image maps with hot spots that don't exist and hyperlinks located in invisible frames without width or height attributes.

Fun Things to Do with Unwanted Spiders

Once unwanted guests are detected, you can treat them to a variety of services.

Identifying a spider is the first step in dealing with it. Moreover, with browser-spoofing techniques, a spider trap becomes a necessity in determining which traffic is automated and which is human. What you do once you detect a spider is up to you, but Table 27-1 should give you some ideas. Just remember to act within commonsense legal guidelines and your own website policies.

Table 27-1. Strategies for Responding When You Identify a Spider

Strategy

Implementation

Banish

Record the IP addresses of spiders that reach the spider trap and configure the webserver to ignore future requests from these addresses.

Limit access

Record the IP addresses of the spiders in the spider trap and limit the pages they can access on their next visit.

Mislead

Depending on the situation, you could redirect known (unwanted) spiders with an alternate set of misleading web pages. As much as I love this tactic, you should consult with an attorney before implementing this idea.

Analyze

Analyze the IP address and find out where the spider comes from, who might own it, and what it is up to. A good resource for identifying IP addresses registered in the United States is http://www.arin.net. You could even create a special log that tracks all activity from known hostile spiders. You can also use this technique to learn whether or not a spider is part of a distributed attack.

Ignore

The default option is to just ignore any automated activity on your website.

Final Thoughts

Before website owners decide to expend their resources on deterring webbots, they should ask themselves a few questions.

What can a webbot do with your website that a person armed with a browser cannot do?

Are your deterrents keeping desirable spiders (like search engines) from accessing your web pages?

Does an automated agent (that you want to thwart) pose an actual threat to your website? Is it possible that it may even provide a benefit, as a procurement bot might?

If your website contains information that needs to be protected from webbots, should that information really be online in the first place?

If you put information in a public place, do you really have the right to bar certain methods of reading it?

If you still insist on banning webbots from your website, keep in mind that unless you deliberately develop measures like the ones near the end of this chapter, you will probably have little luck in defending your site from rogue webbots.

Chapter 28. KEEPING WEBBOTS OUT OF TROUBLE

By this point, you know how to access, download, parse, and process any of the 76 million websites on the Internet.[80] Knowing how to do something, however, does not give you the

right to do it. While I have cast warnings throughout

Return Main Page Previous Page Next Page

®Online Book Reader