Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [53]

By Root 317 0
where you insert your parsing tags, you can figure out where your parsing script thinks the desired data is.

Think of the text you want to parse as blocks of text surrounded by other blocks of text you don't need. Imagine that the web page you want to parse looks like Figure 11-5, where the desired text is depicted as the dark blocks. Find the beginning of the first block you want to parse. Strip away everything prior to this point and insert a tag at the beginning of this block (Figure 11-6). Replace the text that separates the blocks of text you want to parse with and tags. Now every block of text you want to parse is sandwiched between and tags (see Figure 11-7). This way, the text can be easily parsed with the parse_array() function. The final tag is an artifact and is ignored.

Figure 11-5. Desired data depicted in dark gray

Figure 11-6. Initiating an insertion parse

Figure 11-7. Delimiting desired text with tags

The script that performs the insertion parse is straightforward, but it depends on accurately identifying the text that surrounds the blocks we want to parse. The first step is to locate the text that identifies that start of the first block. The only way to do this is to look at the HTML source code of the search results. A quick examination reveals that the first organic is immediately preceded by .[37] The next step is to find some common text that separates each organic search result. In this case, the search terms are also separated by .

To place the tag at the beginning of the first block, the webbot uses the strops() function to determine where the first block of text begins. That position is then used in conjunction with substr() to strip away everything before the first block. Then a simple string concatenation places a tag in front of the first block, as shown in Listing 11-6.

// We need to place the first tag before the first piece

// of desired data, which we know starts with the first occurrence

// of $separator

$separator = "";

// Find first occurrence of $separator

$beg_position = strpos($page, $separator);

// Get rid of everything before the first piece of desired data

// and insert a tag before the data

$page = substr($page, $beg_position, strlen($page));

$page = "".$page;

Listing 11-6: Inserting the initial insertion parse tag (as in Figure 11-6)

The insertion parse is completed with the insertion of the and tags. The webbot does this by simply replacing the block separator that we identified earlier with our insertion tags, as shown in Listing 11-7.

$page = str_replace($separator, " ", $page);

// Put all the desired content into an array

$desired_content_array = parse_array($page, "", "", EXCL);

Listing 11-7: Inserting the insertion delimiter tags (as in Figure 11-7)

Once the insertion is complete, each block of text is sandwiched between tags that allow the webbot to use the parse_array() function to create an array in which each array element is one of the blocks. Could you perform this parse without the insertion parse technique? Of course. However, the insertion parse is more flexible and easier to debug, because you have more control over where the delimiters are placed, and you can see where the file will be parsed before the parse occurs.

Once the search results are parsed and placed into an array, it's a simple process to compare them with the web page we're ranking, as in Listing 11-8.

for($page_rank=0; $page_rank{

// Look for the $subject_site to appear in one of the listings

if(stristr($desired_content_array[$page_rank], $subject_site))

{

$url_found_rank_on_page = $page_rank;

$url_found=true;

}

}

Listing 11-8: Determining if an organic matches the subject web page

If the web page we're looking for is found, the webbot records its ranking and sets a flag to tell the webbot to stop looking for additional occurrences of

Return Main Page Previous Page Next Page

®Online Book Reader