Webbots, Spiders, and Screen Scrapers - Michael Schrenk [6]
Software
In an effort to be as relevant as possible, the software examples in this book use PHP,[2] cURL,[3] and MySQL.[4] All of these software technologies are available as free downloads from their respective websites. In addition to being free, these software packages are wonderfully portable and function well on a variety of computers and operating systems.
Note
If you're going to follow the script examples in this book, you will need a basic knowledge of PHP. This book assumes you know how to program.
Internet Access
A connection to the Internet is very handy, but not entirely necessary. If you lack a network connection, you can create your own local intranet (one or more webservers on a private network) by loading Apache[5] onto your computer, and if that's not possible, you can design programs that use local files as targets. However, neither of these options is as fun as writing webbots that use a live Internet connection. In addition, if you lack an Internet connection, you will not have access to the online resources, which add a lot of value to your learning experience.
* * *
[2] See http://www.php.net.
[3] See http://curl.haxx.se.
[4] See http://www.mysql.com.
[5] See http://www.apache.org.
A Disclaimer (This Is Important)
As with anything you develop, you must take responsibility for your own actions. From a technology standpoint, there is little to distinguish a beneficial webbot from one that does destructive things. The main difference is the intent of the developer (and how well you debug your scripts). Therefore, it's up to you to do constructive things with the information in this book and not violate copyright law, disrupt networks, or do anything else that would be troublesome or illegal. And if you do, don't call me.
Please reference Chapter 28 for insight into how to write webbots ethically. Chapter 28 will help you do this, but it won't provide legal advice. If you have questions, talk to a lawyer before you experiment.
Part I. FUNDAMENTAL CONCEPTS AND TECHNIQUES
While most web development books explain how to create websites, this book teaches developers how to combine, adapt, and automate existing websites to fit their specific needs. Part I introduces the concept of web automation and explores elementary techniques to harness the resources of the Web.
Chapter 1
This chapter explores why it is fun to write webbots and why webbot development is a rewarding career with expanding possibilities.
Chapter 2
We've been led to believe that the only way to use a website is with a browser. If, however, you examine what you want to do, as opposed to what a browser allows you to do, you'll look at your favorite web resources in a whole new way. This chapter discusses existing as well as potential webbots.
Chapter 3
This chapter introduces PHP/CURL, the free library that makes it easy to download web pages—even when the targeted web pages use advanced techniques like forwarding, encryption, authentication, and cookies.
Chapter 4
Downloaded web pages aren't of any use until your webbot can separate the data you need from the data you don't need.
Chapter 5
To truly automate web agents, your application needs the ability to automatically upload data to online forms.
Chapter 6
Spiders in particular can generate huge amounts of data. That's why it's important for you to know how to effectively store and reduce the size of web pages, text, and images.
You may already have experience from other areas of computer science that you can apply to these activities. However, even if these concepts are familiar to you, developing webbots may force you to view these skills in a different context, so the following chapters are still worth reading. If you don't already have experience in these areas, the next six chapters will provide the basics for designing and developing webbots. You'll