Webbots, Spiders, and Screen Scrapers - Michael Schrenk [22]
to this point, parsing meant extracting desired text from a larger string. Sometimes, however, parsing means manipulating text. For example, since webbots usually lack JavaScript interpreters, it's often best to delete JavaScript from downloaded files. In other cases, your webbots may need to remove all images or email addresses from a web page. For these reasons, LIB_parse includes the remove() function. The remove() function is an easy-to-use interface for removing unwanted text from a web page. Its usage is shown in Listing 4-9.
/*
string remove( string web page
, string open_tag
, string close_tag
)
Where
web_page
is the contents of the web page you want to affect
open_tag
defines the beginning of the text that you want to remove
close_tag
defines the end of the text you want to remove
*/
Listing 4-9: Using remove()
By adjusting the input parameters, the remove() function can remove a variety of text from web pages, as shown in Listing 4-10.
$uncommented_page = remove($web_page, "");
$links_removed = remove($web_page, ""); $images_removed = remove($web_page, " $javascript_removed = remove($web_page, "
-->
");