Online Book Reader

Home Category

Choose a category
All
Classic-Fiction

Beautiful Code [35]

By Root 7275 0

to the problems of biological data management is called bioinformatics.

Bioinformatics is similar in many ways to software engineering on Wall Street. Like software engineers in the financial sector, bioinformaticians need to be fleet of foot: they have to get applications up and running quickly with little time for requirement analysis and design. Data sets are large and mutable, with shelf lives measured in months, not years. For this reason, most bioinformatics developers favor agile development techniques, such as eXtreme Programming, and toolkits that allow for rapid prototyping and deployment. As in the financial sector, there is also a strong emphasis on data visualization and pattern recognition.

12.1. BioPerl and the Bio::Graphics Module

One of the rapid development toolkits developed by and for bioinformaticians is BioPerl, an extensive open source library of reusable code for the Perl programming language. BioPerl provides modules to handle most common bioinformatics problems involving DNA and protein analysis, the construction and analysis of evolutionary trees, the interpretation of genetic data, and, of course, genome sequence analysis.

BioPerl allows a software engineer to rapidly create complex pipelines to process, filter, analyze, integrate, and visualize large biological datasets. Because of its extensive testing by the open source community, applications built on top of BioPerl are more likely to work right the first time, and because Perl interpreters are available for all major platforms, applications written with BioPerl will run on Microsoft Windows, Mac OS X, Linux, and Unix machines.

This chapter discusses Bio::Graphics, BioPerl's genome map rendering module. The problem it addresses is how to visualize a genome and its annotations. A genome consists of a set of DNA sequences, each a string of the letters [A,G,C,T], which are nucleotides, also known as base pairs, or bp. Some of the DNA sequence strings can be quite long: for example, the human genome consists of 24 DNA sequence strings, one each for chromosomes 1 through 22, plus the X and Y chromosomes. The longest of these, chromosome 1, is roughly 150,000,000 bp long (150 megabases).

Hidden inside these DNA sequences are multiple regions that play roles in cell metabolism, reproduction, defense, and signaling. For example, some sections of the chromosome 1 DNA sequence are protein-coding genes. These genes are "transcribed" by the cell into shorter RNA sequences that are transported from the cell nucleus into the cytoplasm; these RNA sequences are then translated into proteins responsible for generating energy, moving nutrients into and out of the cell, making the cell membrane, and so on. Other regions of the DNA sequence are regulatory in nature: when a regulatory protein binds to a specific regulatory site, a nearby protein-coding gene is "turned on" and starts to be transcribed. Some regions correspond to parasitic DNA: short regions of sequence that can replicate themselves semiautonomously and hitchhike around on the genome. Still other regions are of unknown significance; we can tell that they're important because they have been conserved among humans and other organisms across long evolutionary intervals, but we don't yet understand what they do.

Finding and interpreting functionally significant regions of the genome is called annotation and is now the major focus of the genome project. The annotation of a genome typically generates far more data than the raw DNA sequence itself. The whole human genome sequence occupies just three gigabytes uncompressed, but its current annotation uses many terabytes (also see Chapter 13).

12.1.1. Example of Bio::Graphics Output

To home in on "interesting" regions of the genome, biologists need to visualize how multiple annotations relate to each other. For example, a putative regulatory region is more likely to be functionally significant if it is spatially close to a protein-coding gene and overlaps with a region that is conserved between evolutionarily distant species.

Bio::Graphics allows

Online Book Reader

Beautiful Code [35]

®Online Book Reader