Webbots, Spiders, and Screen Scrapers - Michael Schrenk [2]
Strengthening Authentication by Combining Techniques
Authentication and Webbots
Example Scripts and Practice Pages
Basic Authentication
Session Authentication
Authentication with Cookie Sessions
Authentication with Query Sessions
Final Thoughts
22. ADVANCED COOKIE MANAGEMENT
How Cookies Work
PHP/CURL and Cookies
How Cookies Challenge Webbot Design
Purging Temporary Cookies
Managing Multiple Users' Cookies
Further Exploration
23. SCHEDULING WEBBOTS AND SPIDERS
The Windows Task Scheduler
Preparing Your Webbots to Run as Scheduled Tasks
Scheduling a Webbot to Run Daily
Complex Schedules
Non-Calendar-Based Triggers
Final Thoughts
Determine the Webbot's Best Periodicity
Avoid Single Points of Failure
Add Variety to Your Schedule
IV. LARGER CONSIDERATIONS
24. DESIGNING STEALTHY WEBBOTS AND SPIDERS
Why Design a Stealthy Webbot?
Log Files
Log-Monitoring Software
Stealth Means Simulating Human Patterns
Be Kind to Your Resources
Run Your Webbot During Busy Hours
Don't Run Your Webbot at the Same Time Each Day
Don't Run Your Webbot on Holidays and Weekends
Use Random, Intra-fetch Delays
Final Thoughts
25. WRITING FAULT-TOLERANT WEBBOTS
Types of Webbot Fault Tolerance
Adapting to Changes in URLs
Adapting to Changes in Page Content
Adapting to Changes in Forms
Adapting to Changes in Cookie Management
Adapting to Network Outages and Network Congestion
Error Handlers
26. DESIGNING WEBBOT-FRIENDLY WEBSITES
Optimizing Web Pages for Search Engine Spiders
Well-Defined Links
Google Bombs and Spam Indexing
Title Tags
Meta Tags
Header Tags
Image alt Attributes
Web Design Techniques That Hinder Search Engine Spiders
JavaScript
Non-ASCII Content
Designing Data-Only Interfaces
XML
Lightweight Data Exchange
SOAP
27. KILLING SPIDERS
Asking Nicely
Create a Terms of Service Agreement
Use the robots.txt File
Use the Robots Meta Tag
Building Speed Bumps
Selectively Allow Access to Specific Web Agents
Use Obfuscation
Use Cookies, Encryption, JavaScript, and Redirection
Authenticate Users
Update Your Site Often
Embed Text in Other Media
Setting Traps
Create a Spider Trap
Fun Things to Do with Unwanted Spiders
Final Thoughts
28. KEEPING WEBBOTS OUT OF TROUBLE
It's All About Respect
Copyright
Do Consult Resources
Don't Be an Armchair Lawyer
Trespass to Chattels
Internet Law
Final Thoughts
A. PHP/CURL REFERENCE
Creating a Minimal PHP/CURL Session
Initiating PHP/CURL Sessions
Setting PHP/CURL Options
CURLOPT_URL
CURLOPT_RETURNTRANSFER
CURLOPT_REFERER
CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS
CURLOPT_USERAGENT
CURLOPT_NOBODY and CURLOPT_HEADER
CURLOPT_TIMEOUT
CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR
CURLOPT_HTTPHEADER
CURLOPT_SSL_VERIFYPEER
CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH
CURLOPT_POST and CURLOPT_POSTFIELDS
CURLOPT_VERBOSE
CURLOPT_PORT
Executing the PHP/CURL Command
Retrieving PHP/CURL Session Information
Viewing PHP/CURL Errors
Closing PHP/CURL Sessions
B. STATUS CODES
HTTP Codes
NNTP Codes
C. SMS EMAIL ADDRESSES
Webbots, Spiders, and Screen Scrapers
Michael Schrenk
Editor
William Pollock
Copyright © 2009
No Starch Press
* * *
Dedication
In loving memory
Charlotte Schrenk
1897–1982
WEBBOTS, SPIDERS, AND SCREEN SCRAPERS. Copyright © 2007 by Michael Schrenk.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher.
Printed on recycled paper in the United States of America