Database Research (Val Tannen, Focus Leader)
The CIPRES project will assume responsibility for Treebase, a community-based database where individual investigator generated phylogenetic results (published or unpublished) can be gathered with sufficient metadata and curation to make this information useful to the wider community. Research and development will be led by Val Tannen, Bill Piel, Brent Mishler, and Micheal Donaghue, and production will be led by Jin Ruan at SDSC. A production Treebase mirror has now been created in MySQL with a PHP front end.
For the first 2 years of the CIPRES project, the focus has been principally on improving the quality of database capabilities available to systematists. Part of this effort is creating a schema that is compatible with rich metadata storage, and that facilitates much more complex queries than possible in the existing Treebase. This new schema (called Treebase2) should be informed by, and compatible with as many of the other current phylogenetic/morphological databases as is practical. It is our goal to interact with other database groups, and to iterate the Treebase2 schema as data standards and requirements for the community change. The current TreeBase2 schema is available here. It is a draft at this time, and will be put iinto production in early 2006.
In the remaining years, the CIPRES database team will be investigating new concepts for improving and expanding tools for data treatments in systematics. The research efforts include investigating new strategies for representing and storing trees, developing ontologies that allow morphological data to be integrated readily with sequence data, developing tools to integrate database functionalities into CIPRES software, and working with other large NSF projects, including SEEK and NESCENT, to develop database schema that are maximally useful across the communities..
In addition to providing the production resource for archiving, curating, and disseminating results of phylogenetic research, the database team is investigating new and better ways to store and present phylogenetic data.
Two other kinds of databases required are:
- a transient database to store the results of large calculations until the total job is completed. Yifeng Zheng, a DB2 programmer from the University of Pennsylvania has created a database that meets the needs of modelers who create large artificial datasets for querying. The SDSC group is working with Yifeng to insure this effort is translated into a production environment at SDSC.
- a persistent database to store original experimental results. Raw experimental data generated in the field, or through computational methods are always a superset of the data to be stored in a public resource. Nevertheless, these data are of critical importance to the researchers or research groups that generate them. To capture these data, individual investigators will require a personal datastore tool that is more extensible and more flexible than a production resource like Treebase. At present, the database team is working to develop a schema for this kind of data resource.
For more information, contact Jin Ruan (Treebase issues) Bill Piel (Data curation and presentation), or Shirley Cohen (Basic database research, notebook efforts)

