Webbots, Spiders, and Screen Scrapers - Michael Schrenk [4]
Old-School Client-Server Technology
My big moment of discovery came when I learned that I didn't need a browser to view web pages. I realized that Telnet, a program used since the early '80s to communicate with networked computers, could also download web pages, as shown in Figure 2.
Figure 1. Viewing a web page with Telnet
Suddenly, the World Wide Web was something I could understand without a browser. It was a familiar client-server architecture where simple clients worked on tasks found on remote servers. The difference here was that the clients were browsers and the servers dished up web pages.
The only revolutionary thing was that, unlike previous client-server client applications, browsers were easy for anyone to use and soon gained mass acceptance. The Internet's audience shifted from physicists and computer programmers to the public. Unfortunately, the general public didn't understand client-server technology, so the dependency on browsers spread further. They didn't understand that there were other ways to use the World Wide Web.
As a programmer, I realized that if I could use Telnet to download web pages, I could also write programs to do the same. I could write my own browser if I desired, or I could write automated agents (webbots, spiders, and screen scrapers) to solve problems that browsers couldn't.
* * *
[1] I stumbled across a fan site for The Brady Bunch during my first World Wide Web experience.
The Problem with Browsers
The basic problem with browsers is that they're manual tools. Your browser only downloads and renders websites: You still need to decide if the web page is relevant, if you've already seen the information it contains, or if you need to follow a link to another web page. What's worse, your browser can't think for itself. It can't notify you when something important happens online, and it certainly won't anticipate your actions, automatically complete forms, make purchases, or download files for you. To do these things, you'll need the automation and intelligence only available with a webbot, or a web robot.
What to Expect from This Book
This book identifies the limitations of typical web browsers and explores how you can use webbots to capitalize on these limitations. You'll learn how to design and write webbots through sample scripts and example projects. Moreover, you'll find answers to larger design questions like these:
Where do ideas for webbot projects come from?
How can I have fun with webbots and stay out of trouble?
Is it possible to write stealthy webbots that run without detection?
What is the trick to writing robust, fault-tolerant webbots that won't break as Internet content changes?
Learn from My Mistakes
I've written webbots, spiders, and screen scrapers for nearly 10 years, and in the process I've made most of the mistakes someone can make. Because webbots are capable of making unconventional demands on websites, system administrators can confuse webbots' requests with attempts to hack into their systems. Thankfully, none of my mistakes has ever led to a courtroom, but they have resulted in intimidating phone calls, scary emails, and very awkward moments. Happily, I can say that I've learned from these situations, and it's been a very long time since I've been across the desk from an angry system administrator. You can spare yourself a lot of grief by reading my stories and learning from my mistakes.
Master Webbot Techniques
You will learn about the technology needed to write a wide assortment of webbots. Some technical skills you'll master include these:
Programmatically downloading websites
Decoding encrypted websites
Unlocking authenticated web pages
Managing cookies
Parsing data
Writing spiders
Managing the large amounts of data that webbots generate
Leverage Existing Scripts
This book uses several code libraries that make it easy for you to write webbots, spiders, and