Online Book Reader

Home Category

Beautiful Code [169]

By Root 5226 0
from the cell nucleus into the cytoplasm; these RNA sequences are then translated into proteins responsible for generating energy, moving nutrients into and out of the cell, making the cell membrane, and so on. Other regions of the DNA sequence are regulatory in nature: when a regulatory protein binds to a specific regulatory site, a nearby protein-coding gene is "turned on" and starts to be transcribed. Some regions correspond to parasitic DNA: short regions of sequence that can replicate themselves semiautonomously and hitchhike around on the genome. Still other regions are of unknown significance; we can tell that they're important because they have been conserved among humans and other organisms across long evolutionary intervals, but we don't yet understand what they do.

Finding and interpreting functionally significant regions of the genome is called annotation and is now the major focus of the genome project. The annotation of a genome typically generates far more data than the raw DNA sequence itself. The whole human genome sequence occupies just three gigabytes uncompressed, but its current annotation uses many terabytes (also see Chapter 13).

12.1.1. Example of Bio::Graphics Output

To home in on "interesting" regions of the genome, biologists need to visualize how multiple annotations relate to each other. For example, a putative regulatory region is more likely to be functionally significant if it is spatially close to a protein-coding gene and overlaps with a region that is conserved between evolutionarily distant species.

Bio::Graphics allows bioinformatics software developers to rapidly visualize a genome and all its annotations. It can be used in a standalone fashion to generate a static image of a region in a variety of graphics formats (including PNG, JPEG, and SVG), or incorporated into a web or desktop application to provide interactive scrolling, zooming, and data exploration.

Figure 12-1 gives an example of an image generated by Bio::Graphics. This image shows a region of the genome of C. elegans (a small soil-dwelling worm) that illustrates several aspects of a typical image generated by Bio::Graphics. The image is divided vertically into a series of horizontal tracks. The top track consists of a scale that runs horizontally from left to right. The units are in kilobases ("k"), indicating thousands of DNA bases. The region shown begins at just before position 160,000 of the C. elegans chromosome I, and extends to just after position 179,000, covering 20,000 base pairs in toto. There are four annotation tracks, each of which illustrates increasingly complex visualizations.

Figure 12-1. A sample image generated by Bio::Graphics

The original image is brightly colored, but has been reduced to grayscale here for printing. The simplest track is "cDNA for RNAi," which shows the positions of a type of experimental reagent that the research community has created for studying the regulation of C. elegans genes. The image contains a single annotation on the right named yk247c7. It consists of a black rectangle that begins at roughly position 173,500 and extends to roughly 176,000. It corresponds to a physical piece of DNA covering this region, which a researcher can order from a biotech supply company and use experimentally to change the activity of the gene that overlaps it—in this case, F56C11.6.

The "WABA alignments" track shows slightly more complex information. It visualizes quantitative data arising from comparing this part of the C. elegans genome to similar regions in a different worm. Regions that are highly similar are dark gray. Regions that are weakly similar are light gray. Regions of intermediate similarity are medium gray.

The "DNA/GC Content" track shows continuously variable quantitative information. This records the ratio of G and C nucleotides to A and T nucleotides across a sliding window of the nucleotide sequence. This ratio correlates roughly with the chances that the corresponding region of the genome contains a protein-coding gene.

The "Genes" track contains the most complex

Return Main Page Previous Page Next Page

®Online Book Reader