Webbots, Spiders, and Screen Scrapers - Michael Schrenk [0]
Table of Contents
ACKNOWLEDGMENTS
Introduction
Old-School Client-Server Technology
The Problem with Browsers
What to Expect from This Book
Learn from My Mistakes
Master Webbot Techniques
Leverage Existing Scripts
About the Website
About the Code
Requirements
Hardware
Software
Internet Access
A Disclaimer (This Is Important)
I. FUNDAMENTAL CONCEPTS AND TECHNIQUES
1. WHAT'S IN IT FOR YOU?
Uncovering the Internet's True Potential
What's in It for Developers?
Webbot Developers Are in Demand
Webbots Are Fun to Write
Webbots Facilitate "Constructive Hacking"
What's in It for Business Leaders?
Customize the Internet for Your Business
Capitalize on the Public's Inexperience with Webbots
Accomplish a Lot with a Small Investment
Final Thoughts
2. IDEAS FOR WEBBOT PROJECTS
Inspiration from Browser Limitations
Webbots That Aggregate and Filter Information for Relevance
Webbots That Interpret What They Find Online
Webbots That Act on Your Behalf
A Few Crazy Ideas to Get You Started
Help Out a Busy Executive
Save Money by Automating Tasks
Protect Intellectual Property
Monitor Opportunities
Verify Access Rights on a Website
Create an Online Clipping Service
Plot Unauthorized Wi-Fi Networks
Track Web Technologies
Allow Incompatible Systems to Communicate
Final Thoughts
3. DOWNLOADING WEB PAGES
Think About Files, Not Web Pages
Downloading Files with PHP's Built-in Functions
Downloading Files with fopen() and fgets()
Downloading Files with file()
Introducing PHP/CURL
Multiple Transfer Protocols
Form Submission
Basic Authentication
Cookies
Redirection
Agent Name Spoofing
Referer Management
Socket Management
Installing PHP/CURL
LIB_http
Familiarizing Yourself with the Default Values
Using LIB_http
Learning More About HTTP Headers
Examining LIB_http's Source Code
Final Thoughts
4. PARSING TECHNIQUES
Parsing Poorly Written HTML
Standard Parse Routines
Using LIB_parse
Splitting a String at a Delimiter: split_string()
Parsing Text Between Delimiters: return_between()
Parsing a Data Set into an Array: parse_array()
Parsing Attribute Values: get_attribute()
Removing Unwanted Text: remove()
Useful PHP Functions
Detecting Whether a String Is Within Another String
Replacing a Portion of a String with Another String
Parsing Unformatted Text
Measuring the Similarity of Strings
Final Thoughts
Don't Trust a Poorly Coded Web Page
Parse in Small Steps
Don't Render Parsed Text While Debugging
Use Regular Expressions Sparingly
5. AUTOMATING FORM SUBMISSION
Reverse Engineering Form Interfaces
Form Handlers, Data Fields, Methods, and Event Triggers
Form Handlers
Data Fields
Methods
Event Triggers
Unpredictable Forms
JavaScript Can Change a Form Just Before Submission
Form HTML Is Often Unreadable by Humans
Cookies Aren't Included in the Form, but Can Affect Operation
Analyzing a Form
Final Thoughts
Don't Blow Your Cover
Correctly Emulate Browsers
Avoid Form Errors
6. MANAGING LARGE AMOUNTS OF DATA
Organizing Data
Naming Conventions
Storing Data in Structured Files
Storing Text in a Database
Storing Images in a Database
Database or File?
Making Data Smaller
Storing References to Image Files
Compressing Data
Removing Formatting
Thumbnailing Images
Final Thoughts
II. PROJECTS
7. PRICE-MONITORING WEBBOTS
The Target
Designing the Parsing Script
Initialization and Downloading the Target
Further Exploration
8. IMAGE-CAPTURING WEBBOTS
Example Image-Capturing Webbot
Creating the Image-Capturing Webbot
Binary-Safe Download Routine
Directory Structure
The Main Script