RAxML Light v. 1.0.5

How does RAxML-Light work in the CIPRES Science Gateway?

Our RAxML-Light interface allows users to take advantage of RAxML-Light for inference of very large trees. It is implemented as a script that combines RAxML 7.2.8, Parsimonator, and RAxML-Light. It is now possible to run RAxML-Light on large compute resources without the need to write perl-scripts that would be required for RAxML-Light alone. RAxML-Light is brought to you by the iPlant Collaborative. To use it, visit the iPlant Discovery Environment, and register.

Why would I use RAxML-Light instead of regular RAxML?

RAxML-Light is designed to decrease the memory footprint of regular tree searches. This makes it possible to analyze very large data sets without exceeding the available memory. Data sets appropriate for RAxML Light have more than 10,000 taxa and 10-20 genes, or more than around two hundred taxa and in the neighborhood of 1000 genes. For smaller data sets, you should probably stick with regular RAxML.

The largest trees that have been computed using RAxML-Light alone include a tree with almost 120,000 taxa and 2 genes, which ran nicely on a single 48 core node with 128GB memory under the CAT model. Data sets with 1,481 taxa and 20,000,000 sites have also been analyzed using 672 cores and almost 1TB of RAM under the CAT model.

The implementation available through the CIPRES Science Gateway runs on a single 32 core node with 64 GB of memory. If you feel your data set requires more resources, please let us know.

RAxML-Light only implements CAT and GAMMA models of rate heterogeneity for DNA and protein data. Today we support only DNA data use, but expect to support protein data use in the near future.

What features does RAxML-Light offer that allow reconstruction of huge trees?  

  • It offers a fine-grain parallelization of the likelihood function with Pthreads for shared memory architectures and MPI (Message Passing Interface) for distributed memory architectures with low latency interconnects (such as Infiniband or Myrinet or, e.g., the dedicated interconnects on the IBM BlueGene systems).
  • A special memory saving option -S makes it possible to save memory and computations on gappy multi-gene alignments (although the program does not necessarily run faster with this option). For example, a large and very gappy (90% gaps) multi-gene alignment (about 120,000 taxa and 10 genes) using the -S option reduced memory consumption from 70 GB to only 19 GB. Note that results (log likelihood scores) may be slightly different, because the models had to be modfied to make this feature work.
  • A new protein model function called AUTO can be selected to automatically choose the best protein substitution matrix (with respect to to the likelihood) during the tree search. (Protein data will supported shortly.)

  • The search convergence criterion (-D option) from standard RAxML v. 7.2.8 has been re-introduced for tree searches on extremely large trees.

  • GAMMA model of rate heterogeneity is implemented now, with the -S memory-saving option.

Please Note:

  • We do not support analysis of protein data currently, but expect to add this feature shortly.
  • We do not support checkpointing and restart capability yet, but we plan to support this feature in the near future.
  • The option to parse and write as well as read input alignments as binary data files is not yet supported on parallel versions of RAxML-Light.

What can RAxML-Light compute?

RAxML-Light is used for analyzing very large trees to infer trees under Maximum Likelihood. Unlike standard RAxML, e a comprehensive (containing all taxa) bifurcating starting tree must be given to RAxML-Light. The script used in the CIPRES Gateway obtains the starting tree from standard RAxML or parsimonator. See Figure 1 below to see how these features work together. RAxML-Light program options are explained in the information sections of the interface, and in the manual. Many of these options are similar to the standard RAxML options.

 

script

Figure1. Workflow for the RAxML-Light interface.  
   

How it Works

To compute a ML tree on data set dna.phy, one need only upload the data set to the CIPRES Gateway, and configure the run. If you request bootstraps, the script will first generate a set of replica alignment files using the -f j option of standard RaxML:

$RAXMLSERIAL -s $sequence_file -m $substitution_model -n BS -f j -b $bseed -N $bsearches

This creates replica alignments called infile.BSn, where n is the number of each bootstrap. The number of bootstraps is user-specified. It can be 0, in which case this step is omitted.

Next, parsimony starting trees are created for the input file and for the replica alignments using parsimonator:

$PARSIMONATOR -s ${sequence_file}.BS\$i -n PB\$i -p \$seed

This creates a parsimony starting tree called RAxML_parsimonyTree.PBn for each replica alignment, and a parsimony tree called RAxML_parsimonyTree.PRn for the input file (or the best likelihood tree from all the bootstrap searches).

Next, RAxML-Light does rapid bootstrap searches on the replica alignments, and regular tree searches on the input data set.

For bootstrap searches:

$RAXMLLIGHT -s ${sequence_file}.BS\$i -m $substitution_model -n LB\$i -D $save_memory -t RAxML_parsimonyTree.PB\${i}.0 -T $searchcores &

This uses replica alignments .BSn, and starting tree parsimonyTree.PB to infer a likelihood tree for each replica data set. The tree is written to RAxML_Tree.LBn, where n is the bootstrap number. These likelihood trees are used to measure convergence and provide support values. The best likelihood tree can also be used as a starting tree for regular searches.

For regular searches:

$RAXMLLIGHT -s $sequence_file -m $substitution_model -t \$start_tree -n LR\$i -T $searchcores $save_memory

This uses the input alignment and either the parsimony tree or the best bootstrap likelihood tree to infer a likelihood tree for each replica data set. This command is repeated for n iterations, the number of regular searches specified by the user. The inferred tree is written to a file .LRn, where n is the regular search replicate number. These likelihood trees are used to measure convergence and provide support values. The best likelihood tree can also be used as a starting tree for regular searches.

REFERENCES

[Ott2007] M. Ott, J. Zola, S. Aluru, A. Stamatakis: “Large-scale Maximum Likelihood-based Phylogenetic Analysis on the IBM BlueGene/L”. In Proceedings of IEEE/ACM Supercomputing (SC2007) conference, Reno, Nevada, November 2007.

[Stamatakis2010] A. Stamatakis: "Phylogenetic Search Algorithms for Maximum Likelihood". In M. Elloumi, A.Y. Zomaya, editors. Algorithms in Computational Biology: techniques, Approaches and Applications, John Wiley and Sons, 2010.

If you use RAxML-Light, please cite:

RAxML-Light version 1.0.5 by Alexandros Stamatakis

Alexandros Stamatakis affiliation is:
Scientific Computing Group
Heidelberg Institute for Theoretical Studies

 
   

 

If there is a tool or a feature you need, please let us know.