Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [110]

By Root 360 0
to the pages within Flash movies. In short, websites done entirely in Flash kill any and all attempts at SEO and will receive less traffic than properly formatted HTML websites.

Designing Data-Only Interfaces

Often, the express purpose of a web page is to deliver data to a webbot, another website, or a stand-alone desktop application. These web pages aren't concerned about how people will read them in a browser. Rather, they are optimized for efficiency and ease of use by other computer programs. For example, you might need to design a web page that provides real-time sales information from an e-commerce site.

XML

Today, the eXtensible Markup Language (XML) is considered the de facto standard for transferring online data. XML describes data by wrapping it in HTML-like tags. For example, consider the sample sales data from an e-commerce site, shown in Table 26-1.

When converted to XML, the data in Table 26-1 looks like Listing 26-7.

Table 26-1. Sample Sales Information

Brand

Style

Color

Size

Price

Gordon LLC

Cotton T

Red

XXL

19.95

Ava St

Girlie T

Blue

S

19.95

Gordon LLC

Red

XXL

19.95

Ava St

Blue

S

19.95

Listing 26-7: An XML version of the data in Table 26-1

XML presents data in a format that is not only easy to parse, but, in some applications, it may also tell the client computer what to do with the data. The actual tags used to describe the data are not terribly important, as long as the XML server and client agree to their meaning. The script in Listing 26-8 downloads and parses the XML represented in the previous listing.

# Include libraries

include("LIB_http.php");

include("LIB_parse.php");

# Download the order

$url = "http://www.schrenk.com/nostarch/webbots/26_1.php";

$download = http_get($url, "");

# Parse the orders

$order_array = return_between($download ['FILE'], "", "", $type=EXCL);

# Parse shirts from order array

$shirts = parse_array($order_array, $open_tag="", $close_tag="");

for($xx=0; $xx{

$brand[$xx] = return_between($shirts[$xx], "", "", $type=EXCL);

$color[$xx] = return_between($shirts[$xx], "", "", $type=EXCL);

$size[$xx] = return_between($shirts[$xx], "", "", $type=EXCL);

$price[$xx] = return_between($shirts[$xx], "", "", $type=EXCL);

}

# Echo data to validate the download and parse

for($xx=0; $xxecho "BRAND=".$brand[$xx]."

COLOR=".$color[$xx]."

SIZE=".$size[$xx]."

PRICE=".$price[$xx]."


";

Listing 26-8: A script that parses XML data

Lightweight Data Exchange

As useful as XML is, it suffers from overhead because it delivers much more protocol than data. While this isn't important with small amounts of XML, the problem of overhead grows along with the size of the XML file. For example, it may take a 30KB XML file to present 10KB of data. Excess overhead needlessly consumes bandwidth and CPU cycles, and it can become expensive on extremely popular websites. In order to reduce overhead, you may consider designing lightweight interfaces. Lightweight interfaces deliver data more efficiently by presenting data in variables or arrays that can be used directly by the webbot. Granted, this is only possible when you define both the web page delivering the data and the client interpreting the data.

How Not to Design a Lightweight Interface

Before we explore proper methods for passing data to webbots, let's explore what can happen if your design doesn't take the proper security measures. For example, consider the order data from Table 26-1, reformatted as variable/value pairs, as shown in Listing 26-9.

$brand[0]="Gordon LLC";

$style[0]="Cotton T";

$color[0]="red";

$size[0]="XXL";

$price[0]=19.95;

$brand[1]="Ava LLC";

$style[0]="Girlie T";

$color[1]="blue";

Return Main Page Previous Page Next Page

®Online Book Reader