Beautiful Code [185]
The Gene Sorter helps scientists rapidly sift through the roughly 25,000 genes in the human genome to find those most relevant to their research. The program is part of the http://genome.ucsc.edu web site, which also contains many other tools for working with data generated by the Human Genome Project. The Gene Sorter design is simple and flexible. It incorporates many lessons we learned in two previous generations of programs that serve biomedical data over the Web. The program uses CGI to gather input from the user, makes queries into a MySQL database, and presents the results in HTML. About half of the program code resides in libraries shared with other http://genome.ucsc.edu tools.
The human genome is a digital code that somehow contains all of the information needed to build a human body, including that most remarkable of organs, the human brain. The information is stored in three billion bases of DNA. Each base can be an A, C, G, or T. Thus, there are two bits of information per base, or 750 megabytes of information in the genome.
It is remarkable that the information to build a human being could fit easily into a memory stick in your pocket. Even more remarkably, we know from an evolutionary analysis of many genomes that only about 10 percent of that information is actually needed. The other 90 percent of the genome consists primarily of relics from evolutionary experiments that turned into dead ends, and in the clutter left by virus-like elements known as transposons.
Most of the currently functional parts of the genome are found in genes. Genes consist of regulatory elements that determine how much of the gene product will be made, and the code for the gene product itself. The regulation of genes is often quite complex. Different types of cells use different genes. The same cell type uses different genes in different situations.
The gene products are diverse, too. A large and important class of genes produce messenger RNA (mRNA), which is then translated into proteins. These proteins include the receptors molecules that let the cell sense the environment and interact with other cells, the enzymes that help convert food to more usable forms of energy, and the transcription factors that control the activity of other genes. Though it has not been an easy job, science has identified about 90 percent of the genes in the genome, over 20,000 genes in all.
Most scientific research projects are interested in just a few dozen of these genes. People researching a rare genetic disease examine the patterns of inheritance of the disease to link the disease to perhaps a 10,000,000-base region of a single chromosome. In recent years scientists have tried to associate 100,000-base regions with more common diseases such as diabetes that are partly but not entirely genetic in nature.
13.1. The User Interface of the Gene Sorter
The Gene Sorter can collect all the known genes in disease-related regions of DNA into a list of candidate genes. This list is displayed in a table, illustrated in Figure 13-1, that includes summary information on each gene and hyperlinks to additional information. The candidate list can be filtered to eliminate genes that are obviously not relevant, such as genes expressed only in the kidneys when the viewer is researching a genetic disease of the lungs. The Gene Sorter is also useful in other contexts where one wants to look at more than one gene at once, such as when one is studying genes that are expressed in similar ways or genes that have similar known functions. The Gene Sorter is available currently for the human, mouse, rat, fruit fly, and C. elegans genomes.
The controls on the top of the screen specify which version of which genome to use. The table underneath contains one row per gene.
Figure 13-1. Main page of the Gene Sorter
A separate configuration page controls which columns are displayed in the table and how they are displayed. A filter page can be used to filter out genes based