Beautiful Code [188]
char *invQueryOne; /* Query that returns key, given value. */
};
The structure starts with data shared by all types of columns. Next come the polymorphic methods. Finally, there's a section containing type-specific data.
Each column object contains space for the data of all types of columns. It would be possible, using a union or some related mechanism, to avoid this waste of space. However, this would complicate the use of the type-specific fields, and because there are fewer than 100 columns, the total space saved would be no more than a few kilobytes.
Most of the functionality of the program resides in the column methods. A column knows how to retrieve data for a particular gene either as a string or as HTML. A column can search for genes where the column data fits a simple search string. The columns also implement the interactive controls to filter data, and the routine to do the filtering itself.
The columns are created by a factory routine based on information in the columnDb.ra files. An excerpt of one of these files is shown in Example 13-2. All columnDb records contain fields describing the column name, user-visible short and long labels, the default location of the column in the table (priority), whether the column is visible by default, and a type field. The type field controls what methods the column has. There may be additional fields, some of which are type-specific. In many cases, the SQL used to query the tables in the database associated with a column is included in the columnDb record, as well as a URL to hyperlink to each item in the column.
Example 13-2. A section of a columnDb.ra file containing metadata on the columns
Code View: Scroll / Show All
name proteinName
shortLabel UniProt
longLabel UniProt (SwissProt/TrEMBL) Protein Display ID
priority 2.1
visibility off
type association kgXref
queryFull select kgID,spDisplayID from kgXref
queryOne select spDisplayId,spID from kgXref where kgID = '%s'
invQueryOne select kgID from kgXref where spDisplayId = '%s'
search fuzzy
itemUrl http://us.expasy.org/cgi-bin/niceprot.pl?%s
name proteinAcc
shortLabel UniProt Acc
longLabel UniProt (SwissProt/TrEMBL) Protein Accession
priority 2.15
visibility off
type lookup kgXref kgID spID
search exact
itemUrl http://us.expasy.org/cgi-bin/niceprot.pl?%s
name refSeq
shortLabel RefSeq
longLabel NCBI RefSeq Gene Accession
priority 2.2
visibility off
type lookup knownToRefSeq name value
search exact
itemUrl http://www.ncbi.nlm.nih.gov/entrez/query.
fcgi?cmd=Search&db=Nucleotide&term=%s&doptcmdl=GenBank&tool=genome.ucsc.edu
The format of a columnDb.ra file is simple: one field per line, and records separated by blank lines. Each line begins with the field name, and the remainder of the line is the field value.
This simple, line-oriented format is used for a lot of the metadata at http://genome.ucsc.edu. At one point, we considered using indexed versions of these files as an alternative to a relational database (.ra stands for relational alternative). But there are a tremendous number of good tools associated with relational databases, so we decided keep the bulk of our data relational. The .ra files are very easy to read, edit, and parse, though, so they see continued use in applications such as these.
The columnDb.ra files are arranged in a three-level directory hierarchy. At the root lies information about columns that appear for all organisms. The mid-level contains information that is organism-specific. As our understanding of a particular organism's genome progresses, we'll have different assemblies of its DNA sequence. The lowest level contains information that is assembly-specific.
The code that reads a columnDb constructs a hash of hashes, where the outer hash is keyed by the column name and the inner hashes are keyed by the field name. Information at the lower levels can contain entirely new records, or add or override particular fields of records first defined at a higher level.
Some types of columns correspond very directly to columns in the