Phylip format: If you are having trouble submitting a valid Phylip file, this page has some tips.

If your file is rejected please try the suggestions below. If they are not sufficient, please let us know. The rules are given below; exceptions we have seen that were fatal for user submissions are highlighted in bold, italic font.

Phylip format is unforgiving, but mercifully simple. There are only two kinds of information provided in Phylip format: Line 1 provides the number of Taxa and Characters in your matrix; Line 2 and subsequent lines provide data in the following rigid format: a Taxon identifier (up to 10 characters) followed by characters for that taxon. Apart from getting the number of taxa and characters correct, the following rules govern Phylip format:

Line number 1 provides number of taxa, then one space (Tab not allowed), then the number of characters in each taxon (each taxon has the same number of characters). An extra carriage return (that is, an empty line between Line 1 and Line 2, or between any other lines) will cause failure.

Line number 2 provides taxon identifier and data. The taxon identifier can be up to 10 characters. Numbers, underscores, spaces, are all allowed.

If you have taxon identifiers that are identical for 10 characters (normally a violation), the CIPRES portal will append an integer to identical taxon names. For example, if Lebistes_r is the taxon name for three different sequences, these will be converted to Lebistes_r#1; Lebistes_r#2; and Lebistes_r#3 and this will be reflected in the output file.

The tool expects the first character state on Column 11 for each and every sequence, no ifs, and, or buts.
Any violation will cause the file to fail.

What should I do if I have long character names, but my file is otherwise in legal Phylip format?

Since this is not valid Phylip, our Phylip parser will fail for this file. However, all is not lost. You can use a simple text editor, and make your file legal Nexus very quickly. Open your file in the text editor. Above all other characters, paste this header

Begin data;
        Dimensions ntax= nchar=;
        Format datatype= missing= gap=;

And at the bottom of the file, type


(there are two semi-colons in the above line)

Next, cut the ntax and nchar values from your phylip file, and paste them after ntax= and nchar=, respectively.
After format datatype= you can enter dna, protein, or standard, as appropriate.
Finally enter the appropriate symbols that indicate missing states in your matrix after missing=
and for gap states after gap= Usually this is ? and - respectively.

Unlike Phylip, Nexus does not tolerate blank spaces in taxon names. Lebistes reticulatus must be represented as Lebistes_reticulatus.

Now your file will be seen as valid nexus. If you make a mistake, the Nexus parser will provide fairly verbose error messages.

The tool will also expect to know the characters in the file.

Input for DNA sequence programs: (shamelessly stolen from Dr. Felsenstein's site, thanks Joe!).

The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The base sequence is one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was is no longer allowed, because it sometimes is used in different senses in other programs). Blanks and numerical digits are ignored. Characters can be either upper or lower case. The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

Symbol Meaning
A Adenine
G Guanine
C Cytosine
T Thymine
U Uracil
Y pYrimidine (C or T)
R puRine (A or G)
W "Weak" (A or T)
S "Strong" (C or G)
K "Keto" (T or G)
M "aMino" (C or A)
B not A (C or G or T)
D not C (A or G or T)
H not G (A or C or T)
V not T (A or C or G)
X,N,? unknown (A or C or G or T)
O deletion
- deletion

Input for the Protein Sequence Programs

The first line contains the number of species and the number of amino acid positions (counting any stop codons that you want to include). The sequences can have internal blanks but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion. The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with the IUB standard abbreviations. In the present version it is:

Symbol Stands for
A ala
B asx
C cys
D asp
E glu
F phe
G gly
H his
I ileu
J (not used)
K lys
L leu
M met
N asn
O (not used)
P pro
Q gln
R arg
S ser
T thr
U (not used)
V val
W trp
X unknown amino acid
Y tyr
Z glx
* nonsense (stop)
? unknown amino acid or deletion
- deletion

where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present. Note that if two polypeptide chains are being used that are of different length owing to one terminating before the other, they can be coded as (say)

             HIINMA*????               HIPNMGVWABT  

since after the stop codon we do not definitely know that there has been a deletion, and do not know what amino acid would have been there. If DNA studies tell us that there is DNA sequence in that region, then we could use "X" rather than "?". Note that "X" means an unknown amino acid, but definitely an amino acid, while "?" could mean either that or a deletion. Otherwise one will usually want to use "?" after a stop codon, if one does not know what amino acid is there. If the DNA sequence has been observed there, one probably ought to resist putting in the amino acids that this DNA would code for, and one should use "X" instead, because under the assumptions implicit in this either the parsimony or the distance methods, changes to any non coding sequence are much easier than changes in a coding region that change the amino acid