Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [98]

By Root 337 0
system administrator will attempt to limit your activity when he or she realizes a webbot is accessing the website. You should strive to limit the number of times your webbot accesses any site. There are no definite rules about how often you can access a website, but remember that if an individual system administrator decides your IP is hitting a site too often, his or her opinion will always trump yours.[67] If you ever exceed your bandwidth budget, you may find yourself blocked from the site.

Error Logs

Like access logs, error logs record access to a website, but unlike access logs, error logs only record errors that occur. A sampling of an actual error log is shown in Listing 24-2.

[Tue Mar 08 14:57:12 2008] [warn] module mod_php4.c is already added, skipping

[Tue Mar 08 15:48:10 2008] [error] [client 127.0.0.1] File does not exist:

c:/program files/apache group/apache/htdocs/favicon.ico

[Tue Mar 08 15:48:13 2008] [error] [client 127.0.0.1] File does not exist:

c:/program files/apache group/apache/htdocs/favicon.ico

[Tue Mar 08 15:48:37 2008] [error] [client 127.0.0.1] File does not exist:

c:/program files/apache group/apache/htdocs/t.gif

Listing 24-2: Typical error log entries

The errors your webbot is most likely to make involve requests for unsupported methods (often HEAD requests) or requesting files that aren't on the website. If your webbot repeatedly commits either of these errors, a system administrator will easily determine that a webbot is making the erroneous page requests, because it is almost impossible to cause these errors when manually surfing with a browser. Since error logs tend to be smaller than access logs, entries in error logs are very obvious to system administrators.

However, not all entries in an error log indicate that something unusual is going on. For example, it's common for people to use expired bookmarks or to follow broken links, both of which could generate File not found errors.

At other times, errors are logged in access logs, not error logs. These errors include using a GET method to send a form instead of a POST (or visa versa), or emulating a form and sending the data to a page that is not a valid action address. These are perhaps the worst errors because they are impossible for someone using a browser to commit—therefore, they will make your webbot stand out like a sore thumb in the log files.

These are the best ways to avoid strange errors in log files:

Debug your webbot's parsing software on web pages that are on your own server before releasing it into the wilderness

Use a form analyzer, as described in Chapter 5, when emulating forms

Program your webbot to stop if it is looking for something specific but cannot find it

Custom Logs

Many web administrators also keep detailed custom logs, which contain additional data not found in either error or access logs. Information that may appear in custom logs includes the following:

The name of the web agent used to download a file

A fully resolved domain name that resolves the requesting IP address

A coherent list of pages a visitor viewed during any one session

The referer to get to the requested page

The first item on the list is very important and easy to address. If you call your webbot test webbot, which is the default setting in LIB_http, the web administrator will finger your webbot as soon as he or she views the log file. Sometimes this is by design; for example, if you want your webbot to be discovered, you may use an agent name like See www.myWebbot.com for more details. I have seen many webbots brand themselves similarly.

If the administrator does a reverse DNS lookup to convert IP addresses to domain names, that makes it very easy to trace the origin of traffic. You should always assume this is happening and restrict the number of times you access a target.

Some metrics programs also create reports that show which pages specific visitors downloaded on sequential visits. If your webbot always downloads the same pages in the same order, you're bound to look odd. For this reason, it's best to

Return Main Page Previous Page Next Page

®Online Book Reader