Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [0]

By Root 295 0
Webbots, Spiders, and Screen Scrapers

Table of Contents

ACKNOWLEDGMENTS

Introduction

Old-School Client-Server Technology

The Problem with Browsers

What to Expect from This Book

Learn from My Mistakes

Master Webbot Techniques

Leverage Existing Scripts

About the Website

About the Code

Requirements

Hardware

Software

Internet Access

A Disclaimer (This Is Important)

I. FUNDAMENTAL CONCEPTS AND TECHNIQUES

1. WHAT'S IN IT FOR YOU?

Uncovering the Internet's True Potential

What's in It for Developers?

Webbot Developers Are in Demand

Webbots Are Fun to Write

Webbots Facilitate "Constructive Hacking"

What's in It for Business Leaders?

Customize the Internet for Your Business

Capitalize on the Public's Inexperience with Webbots

Accomplish a Lot with a Small Investment

Final Thoughts

2. IDEAS FOR WEBBOT PROJECTS

Inspiration from Browser Limitations

Webbots That Aggregate and Filter Information for Relevance

Webbots That Interpret What They Find Online

Webbots That Act on Your Behalf

A Few Crazy Ideas to Get You Started

Help Out a Busy Executive

Save Money by Automating Tasks

Protect Intellectual Property

Monitor Opportunities

Verify Access Rights on a Website

Create an Online Clipping Service

Plot Unauthorized Wi-Fi Networks

Track Web Technologies

Allow Incompatible Systems to Communicate

Final Thoughts

3. DOWNLOADING WEB PAGES

Think About Files, Not Web Pages

Downloading Files with PHP's Built-in Functions

Downloading Files with fopen() and fgets()

Downloading Files with file()

Introducing PHP/CURL

Multiple Transfer Protocols

Form Submission

Basic Authentication

Cookies

Redirection

Agent Name Spoofing

Referer Management

Socket Management

Installing PHP/CURL

LIB_http

Familiarizing Yourself with the Default Values

Using LIB_http

Learning More About HTTP Headers

Examining LIB_http's Source Code

Final Thoughts

4. PARSING TECHNIQUES

Parsing Poorly Written HTML

Standard Parse Routines

Using LIB_parse

Splitting a String at a Delimiter: split_string()

Parsing Text Between Delimiters: return_between()

Parsing a Data Set into an Array: parse_array()

Parsing Attribute Values: get_attribute()

Removing Unwanted Text: remove()

Useful PHP Functions

Detecting Whether a String Is Within Another String

Replacing a Portion of a String with Another String

Parsing Unformatted Text

Measuring the Similarity of Strings

Final Thoughts

Don't Trust a Poorly Coded Web Page

Parse in Small Steps

Don't Render Parsed Text While Debugging

Use Regular Expressions Sparingly

5. AUTOMATING FORM SUBMISSION

Reverse Engineering Form Interfaces

Form Handlers, Data Fields, Methods, and Event Triggers

Form Handlers

Data Fields

Methods

Event Triggers

Unpredictable Forms

JavaScript Can Change a Form Just Before Submission

Form HTML Is Often Unreadable by Humans

Cookies Aren't Included in the Form, but Can Affect Operation

Analyzing a Form

Final Thoughts

Don't Blow Your Cover

Correctly Emulate Browsers

Avoid Form Errors

6. MANAGING LARGE AMOUNTS OF DATA

Organizing Data

Naming Conventions

Storing Data in Structured Files

Storing Text in a Database

Storing Images in a Database

Database or File?

Making Data Smaller

Storing References to Image Files

Compressing Data

Removing Formatting

Thumbnailing Images

Final Thoughts

II. PROJECTS

7. PRICE-MONITORING WEBBOTS

The Target

Designing the Parsing Script

Initialization and Downloading the Target

Further Exploration

8. IMAGE-CAPTURING WEBBOTS

Example Image-Capturing Webbot

Creating the Image-Capturing Webbot

Binary-Safe Download Routine

Directory Structure

The Main Script

Return Main Page Next Page

®Online Book Reader