Beautiful Code [168]
Laws and international human rights treaties can go only so far in protecting individual rights. History has shown that these are far too easy to bypass or ignore. On the other hand, the mathematics of cryptosystems can, when implemented carefully, provide practically impenetrable shields for individual rights and open expression, and can finally set people across the world free to communicate and trade in privacy and liberty.
Actualizing this global, mathematically protected open society is largely up to us, the gods of the machine.
Growing Beautiful Code in BioPerl > BioPerl and the Bio::Graphics Module
12. Growing Beautiful Code in BioPerl
Lincoln Stein
In the past decade, biology has blossomed into an information science. New technologies provide biologists with unprecedented windows into the intricate processes going on inside the cells of animals and plants. DNA sequencing machines allow the rapid readout of complete genome sequences; microarray technologies give snapshots of the complex patterns of gene expression in developing organisms; confocal microscopes produce 3-D movies to track changes in cell architecture as precancerous tissues turn malignant.
These new technologies routinely generate terabytes of data, all of which must be filtered, stored, manipulated and data-mined. The application of computer science and software engineering to the problems of biological data management is called bioinformatics.
Bioinformatics is similar in many ways to software engineering on Wall Street. Like software engineers in the financial sector, bioinformaticians need to be fleet of foot: they have to get applications up and running quickly with little time for requirement analysis and design. Data sets are large and mutable, with shelf lives measured in months, not years. For this reason, most bioinformatics developers favor agile development techniques, such as eXtreme Programming, and toolkits that allow for rapid prototyping and deployment. As in the financial sector, there is also a strong emphasis on data visualization and pattern recognition.
12.1. BioPerl and the Bio::Graphics Module
One of the rapid development toolkits developed by and for bioinformaticians is BioPerl, an extensive open source library of reusable code for the Perl programming language. BioPerl provides modules to handle most common bioinformatics problems involving DNA and protein analysis, the construction and analysis of evolutionary trees, the interpretation of genetic data, and, of course, genome sequence analysis.
BioPerl allows a software engineer to rapidly create complex pipelines to process, filter, analyze, integrate, and visualize large biological datasets. Because of its extensive testing by the open source community, applications built on top of BioPerl are more likely to work right the first time, and because Perl interpreters are available for all major platforms, applications written with BioPerl will run on Microsoft Windows, Mac OS X, Linux, and Unix machines.
This chapter discusses Bio::Graphics, BioPerl's genome map rendering module. The problem it addresses is how to visualize a genome and its annotations. A genome consists of a set of DNA sequences, each a string of the letters [A,G,C,T], which are nucleotides, also known as base pairs, or bp. Some of the DNA sequence strings can be quite long: for example, the human genome consists of 24 DNA sequence strings, one each for chromosomes 1 through 22, plus the X and Y chromosomes. The longest of these, chromosome 1, is roughly 150,000,000 bp long (150 megabases).
Hidden inside these DNA sequences are multiple regions that play roles in cell metabolism, reproduction, defense, and signaling. For example, some sections of the chromosome 1 DNA sequence are protein-coding genes. These genes are "transcribed" by the cell into shorter RNA sequences that are transported