If you haven't noticed, I have not touched the site in a while. That has recently changed. There are now 998 organisms in the database. It sat stagnant at 803 for the longest time. The reason for this was that I was working on a paper that required the database to be frozen. Also, I was very busy in my new position at University of Montana. I have implemented a tactic of creating an entire copy of the database when doing a research project, thus freezing the database for the purposes of research, but allowing the "live" database to continue being updated. In the interest of full disclosure, the process of updating the database is underway even as you read this. The new organisms have been added with all of the traditional measures in place. Next I will run my algorithms on them. This takes a while since one involves multiple SCCI runs and the other requires 400 generations of a GA run. After that, there are 804 genomes that have been updated since I last ran the algorithms on them. I will begin updating them once mSCCI and the GA are completed on the new organisms.

Another point of interest for those of you following the evolution of this website: I have a master's student working on a front end to the database. His name is Kevin Scott, and I anticipate work beginning in the next few weeks. The work will be performed on a test system, and when the new front end is ready to be rolled out (after testing, etc.), then it will be installed on this site.

What you can expect: an interface will be desinged for performing queries on the database. As an example, you will be able to generate a listing of organisms that is comprised only of those dominant for translational efficiency bias, and that are unconfounded by GC-content. Additionally, Kevin will build a graphical navigation system that will allow you to visually explore a given organisms chromosome to see where highly biased regions are. This should include features such as zoom and search, but we are in the early stages of the design process. If you have any suggestions or requests, feel free to send them to me at my email acount at University of Montana.

I now have code that is able to scrape the phylogenetic relationshipa from NCBI's taxonomy browser. I would like to include this information in the database. Currently, I include class and phylum only, but this is retrieved from fields in the annotated sequence files, and does not always agree with the more formal listing on the organism's phylogeny page.

I want to add strand criterion to my measures. This will require knowledge of where the chromosomal replication origin site is. Once this is known, the genes that reside on the leading and lagging strands can be identified, and differences in their codon composition can be assessed.

At some point, I want to add a visualization of the fitness landscapes for all of the organisms. The fitness landscape visualization technique is described in my mSCCI paper (Raiford et al.). It currently shows how self-consistent reference sets are in the codon usage space. I may try to find a way to alter it to show the fitness of weight-solutions (as in the search-based approach: Raiford et al.). This would show only translational efficiency ridges, so I may not persue that particular aspect of fitness landscapes. The current method shows how self-consistent the fitness lantscapes are, so if the dominant bias is strand or content, they will present as a ridge as long as they manifest themselves in self-consitent reference sets.

Codon Usage Bias Database (CUB-DB)

The Codon Usage Bias Database is a showcase for my bias measures, modified Self-Consistent Codon Index (mSCCI) and a direct search approach for codon adaptiveness (weights) using a genetic algorithm (GA). The first modifies the Carbone et al. SCCI algorithm to direct it toward translational efficiency bias instead of the dominant bias (mSCCI, Raiford et al.). The second searches for a set of weights that explain the expected high-placement of the reference set genes in a sorted list of genes by bias adherence score (Direct Search, Raiford et al.). This is the bias measure that I recommend you use if you wish to measure adherence to translational efficiency bias.

In addition to mSCCI and the search based approach, the database contains various other bias measures (SCCI, CAI, FOP, MCU, Nc, Scaled Χ^2, and tAI) for all of the genomes sequenced and stored on the NCBI Microbial Genomes Website. This database is synchronized with the NCBI database on a regular basis (roughly weekly) so that any sequenced microbial organism should be found on this site within a short time of its being published.

The database currently has a size of approximately 41Gig and is comprised mostly of text files residing on a filesystem. Statistics calculated on a genome-wide basis are stored in a MySQL database, while other data, such as gene-centered statistics, are stored on the filesystem.

The measures stored here are those for which I am most familiar. The implementations of the algorithms are my own (written in PERL, and in the case of the GA, in C++), and as such, are based upon my interpretation of their published methods. As the website grows I will include such things as mechanisms for user feedback so that errors in the data or suggestions for improving the site can be communicated to me.

The site will evolve over time, and additional methods and/or data will be included. Additionally, more and more context sensitive help will be added to make the site more user friendly.

I hope you find the site useful for your research.
Your friendly neighborhood codon usage guy,
Doug Raiford









































































Google

The content and opinions expressed on this Web page do not necessarily reflect the views of nor are they endorsed by Wright State University