Modeling Research (Junhyong Kim, Focus Leader)
Current models of phylogenetic data largely ignore the fact that evolution is not exactly "tree-like." Horizontal transfer of genetic material and hybridization give rise to networks rather than trees. Differences between gene trees and species trees also complicate the reconstruction and restrict sequence lengths for phylogenetic analyses. Even at the genomic level, other evolutionary events must be taken into account, such as gene rearrangement events. Beyond that level, many important factors affect evolution, such as gene regulation, metabolism, natural selection on phenotypes, population ecology, climate, and geology.
This modeling group has two goals:
- to generate benchmark data for the critical assessment of algorithms
(a preliminary static simulated dataset i.e. not using the random sampling DB scheme; is available here.) - to formulate realistic models of molecular evolution
Real and simulated data will be used for benchmarks: the former is our real target, but the latter enables us to test competing reconstruction strategies for such attributes as accuracy, convergence rate, and robustness that cannot be easily assessed with real data.
Model development: A key aspect of our research is the development of biologically realistic and compelling models of sequence evolution. We are using a multi-layered approach for our stochastic model development.
The initial layer is a "key molecule" simulation, in which we build models based on detailed comparative information collected for "key molecules" that are commonly used in phylogenetics, for example small subunit ribosomal RNA and full-length cytochrome b sequences from mammals. We will develop a mixed model consisting of many distinct submodels that can each explain the evolution of a fraction of the sites in the gene.
The second layer will extend this model to include whole-genome processes such as horizontal transfer, co-evolution, and interacting proteins.
The third layer will be to develop biologically relevant fitness functions that capture the important interactions.
The first two layers of simulation only take into account so-called "macro-evolutionary" processes, where a single representative genome stands for an entire population. Numerical simulation of micro-evolutionary processes has long been a research tool in evolutionary biology, genetic algorithms, genetic programming, and artiicial life. We are slowly gaining insight from large-scale statistical characterizations and detailed models of regulatory and metabolic networks.
The fourth layer will include the simulation of a simple and well-studied biological system, a virus. Viruses typically contain fewer genes than the genomes of higher organisms, but also feature evolutionary complexities such as gene interaction, gene duplication, and mechanisms for genetic exchange. The combination of their small size and inherent complexity makes viral genomes an interesting testbed for simulation and testing. We will simulate biological data for an RNA virus, vesicular stomatitis virus (VSV), using an approach similar to that described for T7. VSV is a pathogen of livestock; like T7, it is small (11Kbp, 5 genes), well described, and widely used for molecular virology and virus evolution.
For more information, contact Junhyong Kim

