Online Book Reader

Home Category

Webbots, Spiders, and Screen Scrapers - Michael Schrenk [2]

By Root 291 0

Strengthening Authentication by Combining Techniques

Authentication and Webbots

Example Scripts and Practice Pages

Basic Authentication

Session Authentication

Authentication with Cookie Sessions

Authentication with Query Sessions

Final Thoughts

22. ADVANCED COOKIE MANAGEMENT

How Cookies Work

PHP/CURL and Cookies

How Cookies Challenge Webbot Design

Purging Temporary Cookies

Managing Multiple Users' Cookies

Further Exploration

23. SCHEDULING WEBBOTS AND SPIDERS

The Windows Task Scheduler

Preparing Your Webbots to Run as Scheduled Tasks

Scheduling a Webbot to Run Daily

Complex Schedules

Non-Calendar-Based Triggers

Final Thoughts

Determine the Webbot's Best Periodicity

Avoid Single Points of Failure

Add Variety to Your Schedule

IV. LARGER CONSIDERATIONS

24. DESIGNING STEALTHY WEBBOTS AND SPIDERS

Why Design a Stealthy Webbot?

Log Files

Log-Monitoring Software

Stealth Means Simulating Human Patterns

Be Kind to Your Resources

Run Your Webbot During Busy Hours

Don't Run Your Webbot at the Same Time Each Day

Don't Run Your Webbot on Holidays and Weekends

Use Random, Intra-fetch Delays

Final Thoughts

25. WRITING FAULT-TOLERANT WEBBOTS

Types of Webbot Fault Tolerance

Adapting to Changes in URLs

Adapting to Changes in Page Content

Adapting to Changes in Forms

Adapting to Changes in Cookie Management

Adapting to Network Outages and Network Congestion

Error Handlers

26. DESIGNING WEBBOT-FRIENDLY WEBSITES

Optimizing Web Pages for Search Engine Spiders

Well-Defined Links

Google Bombs and Spam Indexing

Title Tags

Meta Tags

Header Tags

Image alt Attributes

Web Design Techniques That Hinder Search Engine Spiders

JavaScript

Non-ASCII Content

Designing Data-Only Interfaces

XML

Lightweight Data Exchange

SOAP

27. KILLING SPIDERS

Asking Nicely

Create a Terms of Service Agreement

Use the robots.txt File

Use the Robots Meta Tag

Building Speed Bumps

Selectively Allow Access to Specific Web Agents

Use Obfuscation

Use Cookies, Encryption, JavaScript, and Redirection

Authenticate Users

Update Your Site Often

Embed Text in Other Media

Setting Traps

Create a Spider Trap

Fun Things to Do with Unwanted Spiders

Final Thoughts

28. KEEPING WEBBOTS OUT OF TROUBLE

It's All About Respect

Copyright

Do Consult Resources

Don't Be an Armchair Lawyer

Trespass to Chattels

Internet Law

Final Thoughts

A. PHP/CURL REFERENCE

Creating a Minimal PHP/CURL Session

Initiating PHP/CURL Sessions

Setting PHP/CURL Options

CURLOPT_URL

CURLOPT_RETURNTRANSFER

CURLOPT_REFERER

CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS

CURLOPT_USERAGENT

CURLOPT_NOBODY and CURLOPT_HEADER

CURLOPT_TIMEOUT

CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR

CURLOPT_HTTPHEADER

CURLOPT_SSL_VERIFYPEER

CURLOPT_USERPWD and CURLOPT_UNRESTRICTED_AUTH

CURLOPT_POST and CURLOPT_POSTFIELDS

CURLOPT_VERBOSE

CURLOPT_PORT

Executing the PHP/CURL Command

Retrieving PHP/CURL Session Information

Viewing PHP/CURL Errors

Closing PHP/CURL Sessions

B. STATUS CODES

HTTP Codes

NNTP Codes

C. SMS EMAIL ADDRESSES

Webbots, Spiders, and Screen Scrapers


Michael Schrenk


Editor

William Pollock

Copyright © 2009

No Starch Press

* * *

Dedication

In loving memory

Charlotte Schrenk

1897–1982

WEBBOTS, SPIDERS, AND SCREEN SCRAPERS. Copyright © 2007 by Michael Schrenk.

All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher.

Printed on recycled paper in the United States of America

Return Main Page Previous Page Next Page

®Online Book Reader