Webbots, Spiders, and Screen Scrapers - Michael Schrenk [31]
If you give members of a certain group labels that are all the same part of speech, don't occasionally throw in a label with another grammatical form. For example, if you have a group of directories named with nouns, don't name another directory in the same group with a verb—and if you do, chances are it probably doesn't belong in that group of things in the first place.
If you are naming files in a directory, you may want to give the files names that will later facilitate easy grouping or sorting. For example, if you are using a filename that defines a date, filenames with the format year_month_day will make more sense when sorted than filenames with the format month_day_year. This is because year, month, and day is a sequential progression from largest to smallest and will accurately reflect order when sorted.
Storing Data in Structured Files
To successfully store files in a structured series of directories, you need to find out what the files have in common. In most cases, the problem you're trying to solve and the means for retrieving the data will dictate the common factors among your files. Figuratively, you need to look for the lowest common denominator for all your files. Figure 6-1 shows a file structure for storing data retrieved by a webbot that runs once a day. Its common theme is time.
Figure 6-1. Example of a structured filesystem primarily based on dates
With the structure defined in Figure 6-1, you could easily locate thumbnail images created by the webbot on February 3, 2006 because the folders comply with the following specification:
drive:\project\year\month\day\category\subcategory\files
Therefore, the path would look like this:
c:\Spider_files\2006\02\03\Graphics\Thumbnails\
People may easily decipher this structure, and so will programs, which need to determine the correct file path programmatically. Figure 6-2 shows another file structure, primarily based on geography.
Figure 6-2. A geographically themed example of a structured filesystem
Ensure that all files have a unique path and that either a person or a computer can easily make sense of these paths.
File structures, like the ones shown in the previous figures, are commonly created by webbots. You'll see how to write webbots that create file structures in Chapter 8.
Storing Text in a Database
While many applications call for file structures similar to the ones shown in Figure 6-1 or Figure 6-2, the majority of projects you're likely to encounter will require that data is stored in a database. A database has many advantages over a file structure. The primary advantage is the ability to query or make requests from the database with a query language called Structured Query Language or SQL (pronounced SEE-quill). SQL allows programs to sort, extract, update, combine, insert, and manipulate data in nearly any imaginable way.
It is not within the scope of this book to teach SQL, but this book does include the LIB_mysql library, which simplifies using SQL with the open source database called MySQL[21] (pronounced my-esk-kew-el).
LIB_mysql
LIB_mysql consists of a few server configurations and three functions, which should handle most of your database needs. These functions act as abstractions or simplifications of the actual interface to the program. Abstractions are important because they allow access to a wide variety of database functions with a common interface and error-reporting method. They also allow you to use a database other than MySQL by creating a similar library for a new database. For example, if you choose to use another database someday, you could