T. N. T.
Tree Analysis Using New Technology
Version 1.1 ã P. Goloboff, J. S. Farris, and K. Nixon
Introduction-Getting Started Saving suboptimal trees
What TNT doesn't do... Memory Management
Printing & Metafiles Tree Collapsing
Continuous characters Batch Menus
Merging Data Files Measures of
support
Basic Character Settings Linux versions
Scopes, sets, and blocks Consensus estimation
Character and block names Search
algorithms
Step-Matrices Setting search
parameters
Ancestral states Implementation
notes…
Nexus files Scripting(advanced users)
General:
Optimality
criteria:
Search
options:
Tree
diagnosis:
Tree
comparisons:
Group
supports, randomization:
Scripting
TNT is a program for phylogenetic analysis under parsimony. It provides fast tree-searching algorithms, as
well as extensive capabilities for tree diagnosis, consensus, and
manipulation. This documentation
explains some important things that are not immediately apparent (or cannot be
done easily) from the menus themselves. For details on how to use commands,
users must refer to the on-line help of TNT (accessed typing help
<enter> at the command prompt or the command dialog opened with File/CommandLine).
This file, tnt.htm, provides a general description of the
program; if the file is copied to your Windows directory, then when selecting
the Help menu option, the Explorer is called to read this file. Throughout this help file, commands that can
be typed at the command line (or included in a file), are marked with italics
and bold; menu options are underlined.
The basic thing you need to do to run TNT is to prepare a data input
file. You can either create this from
the program itself, with the File/New and Data/Edit menu options
(Windows only), or in a text editor.
This file should contain the data matrix (format specified in the next
section, Data Input); it may contain also specifications of character names and settings,
like weights and additivities (this can be later
specified once within the program, but it is advisable to have this in the
file, so you do not forget to change something you need to change; see under Basic Character Settings, Step-Matrices, Ancestral states, and Character and block names). Check the
files zilla.tnt,
example.tnt,
and contin.tnt, three example data files
provided with the program. Note that all
the input files must be text-only files.
Once you have read the data in, you are ready to start doing searches,
etc. Like most parsimony programs (i.e. Hennig86, Paup,
or Pee-Wee/NONA), TNT has an internal (memory) buffer for trees, and many of
the commands and options (optimization, consensing,
etc.) are available only when there are trees in memory, as these commands are
performed on pre-existing trees. Some
search algorithms create trees de novo (others work on preexisting trees, modifying them to try to find better
trees; see under Tree-searching algorithms). The trees in the memory buffer can be saved to
disk, so they are not lost when you exit the program; you can subsequently read
the trees into the program. The text
output produced by the program (i.e. results and timings of runs,
tree-diagrams, etc.) is written onto an internal text buffer, which can be
later saved to disk.
If you have a Windows version, make sure you install the font provided
with the program, tred.ttf, in your
system. This font is required for TNT to
draw trees properly. If the tred font is not
installed, TNT will attempt to use the font terminal (which also draws trees, if not as nice as tred's).
There are many things the program does not do. First, the program is conceived as a tool for
standard parsimony analysis. It does not
do maximum likelihood, it does not use specific models of evolution, it does not do bayesian
analyses. Although we anticipate
incorporating some likelihood algorithms, they will be quite rudimentary.
Second,
the program does not do sequence alignment of any kind. Sequence comparison is (computationally)
quite a different problem from standard parsimony analysis. If you need
sequence alignment, get a program that does it properly.
Third,
the program does not incorporate methods which are potentially useful but are
somewhat experimental. The "support
weighting" method of Farris (implemented in Farris' program zac) is not included in TNT. That method (which is based on mapping
character changes on a tree in order to evaluate it) requires operations quite
different from the ones used in TNT; we do not anticipate including that method
in the program.
Fourth,
the program uses exact algorithms for Sankoff characters. These work fine for low-medium numbers of
states, but they are not very efficient for very large numbers of states. Thus, for step matrix characters with many
states, TNT is not very efficient. Note
that this applies only to Sankoff ("step-matrix") characters; additive
and nonadditive characters with many states are treated efficiently.
Fifth, the exact solution algorithms in
TNT ("implicit enumeration") are not specially
efficient –their speed is about the same as that of PAUP* or the 15-year-old,
16-bit Hennig86. Since the program is
conceived mostly as a tool for analysis of large data sets, there seemed to be
no point in using energy in improving the implicit enumeration algorithms.
Sixth, TNT does not implement any kind
of multiple-cut-swapping (commands mswap, swap,
tswap, and altswap in
Nona/Pee-Wee). Those swappers were less
effective than multiple addition sequences.
Likewise, the swap command of Nona/Pee-Wee (i.e.
report fit differences when moving specified clade around in the tree) is not
directly implemented in TNT (it would probably be easy to implement that using
the sprit
or tbrit options of the scripting
language). These are probably the only
options present in Nona/Pee-Wee but absent in TNT.
The Windows version of TNT has rudimentary printing
capabilities. You can print the display
buffer as a text file, using File/Output/PrintDisplayBuffer
(optionally, you can save the text in the display buffer to a file, and print
it later with your favorite word processor; choose File/Output/OpenOutputFile, and all newly produced output will be
written to the file; optionally, with File/Output/SaveDisplayBuffer,
you can save previously produced output).
To print tree diagrams, you have to set the pre-view for trees as ON
(under Format/PreviewTrees). If you want to map (=optimize) characters in color, choose Format/MapCharactersInColor
(set the colors you like with Format/ColorsForMapping).
Then, draw the trees and (when in pre-view screen) press
"P." When in pre-view screen,
pressing "M" saves the tree diagram to a metafile (which can be later
processed/printed with Corel Draw, or exported to Microsoft Word or Power
Point).
TNT reads matrices in Hennig86 format, with some refinements. Normally, the data will be included in a
file. The basic format for data input is:
xread
'optional title, starting and ending
with quotes (ASCII
39)'
nchar ntax
Taxon0 0000000000
Taxon2 1011110000
Taxon3 1111111000
.....
TaxonN 1111111000
;
the semicolon at the end is normally not required but is
useful to detect errors (e.g. is the number of taxa set too high?).
Polymorphisms can be read, by enclosing the state(s) in square
brackets. By default, up to 16 states
are allowed by xread, using symbols 0-9 to
indicate states 0-9, and A-F to indicate states 10-15. This can be changed with the nstates command, which defines the number of states (and
data type) to be subsequently read; in Windows versions, this can be changed
with Format/DataType. If the data are defined as DNA (with Format/DataType under menus, or using the command nstates dna;) the IUPAC nucleotide codes (including polymorphisms) are
recognized and used throughout. The states
are internally stored as A=0, G=1, C=2, D=3, and gap=4. If the data are instead defined as proteins
(with the menus, or using nstates prot;),
the IUPAC codes are recognized and used throughout. For 32 states, non-protein data, the symbols
0-9 and A-V indicate states 0-31.
If the data are defined as smaller numbers of states, they occupy less
memory. This may be necessary for large
data sets.
The data can also be read as interleaved. For this, all that is required is that each
block of data is preceded by either the & symbol
or the ASCII character number 4 (the latter is what PAUP* uses). The taxa in
each block must have the same exact names (including capitals). The taxa in each block can be arranged in a
different order; the taxon numbers correspond to the first occurence
of a name. A taxon which is not included
in a given block, will have only missing entries for the characters in that
block; the same number of characters must be specified for all the taxa within
a given block. Each row of characters
must end with a carriage return. The end
of the data is indicated with a semicolon; if some characters were never
specified, they are left as missing entries (and a warning is issued). If some taxon names were never defined, a
warning is issued. The interleaved
format allows mixing of different data types, by simply enclosing the
corresponding string (prot, num,
or dna) in square brackets right after the &
symbol.
When the data are being read as interleaved, it is possible for
different blocks of data to have different formats. For this, specify between square brackets,
immediately following the '&' that indicates a new block, the format of the
data (numeric, dna,
proteins, continuous). To specify how to
read a state in subsequent commands that refer to states (cost, change,
or usminmax), use the '/' option of the
nstates command.
While other programs require that continuous characters be discretized,
TNT can deal with continuous characters as such. This avoids the use of the rather ad hoc
methods that have been proposed to discretize continuous distributions in
phylogenetic analysis (gap-coding, Thiele's method, etc.; see Farris, 1990 for
discussion of some of these methods). If there is significant variability in
one of the terminals, it will probably be best represented by a range of one
(or two) standard deviations around its mean.
For normal distributions, this automatically takes care of
non-significant differences.
Continuous characters in TNT can have states 0-65, with 3 decimals
(thus, when not doing implied weighting, tree scores are reported always with 3
decimals). They are optimized always as
additive, i.e., using Farris optimization.
They cannot be turned to nonadditive or sankoff. The
continuous characters must be the first ones in the matrix (i.e. the first
block or blocks, followed by blocks with other data formats). Each value must be separated by a blank or
whitespace; ranges are indicated by two values separated by a dash (-). Missing entries are indicated only as
question marks (?).
The treatment for the data when there are continuous characters is
rather transparent. The commands/options
that show reconstructions (recons command, or Optimize/Characters/Reconstructions)
or count number of specific transformations (change command, Optimize/CountSpecificChanges) cannot be applied to continuous
characters. Continuous characters can be
edited with the xread=
command, and with the menu option Data/Edit/Taxon, but not with Data/Edit/Character. Continuous characters can be named with the cnames command, but their
states cannot.
Standardization of continuous characters (i.e. rescaling so that the
values are on a common scale) is left to the user. Under pre-defined weights, standardization
will normally be an issue. Under implied weighting,
only the default formula can be applied (i.e. user-weighting functions are not
allowed for continuous characters).
Since continuous characters tend to have a larger s-m difference
(s: steps, m: minimum
possible), implied weighting should normally penalize them rather heavily.
Thus, standardization of continuous characters is probably not so important
under implied weighting.
The dmerge command merges data
files, and saves them as blocks; the character names and settings are also
merged (in Windows version, File/MergeDataFiles, using
Ctrl+Left Mouse Button to select multiple
files). The dmerge
command also merges (by using * as argument) the characters with identical
character names (in Windows version, with Data/MergeHomonymousCharacters). Some basic rules of correspondence between
states and state names must be satisfied.
In the Windows version, characters settings can be determined with Data/CharacterSettings. Current settings can be displayed
with Data/ShowStatus. Alternatively, settings
can be included in the data file, with the ccode
command.
The ccode command is similar to
Hennig86's. It uses specifiers
to determine actions on subsequently selected characters:
+ make additive
- make nonadditive
[ activate
] deactivate
/N give weight N (if no number immediately
follows the slash,
set weight to 1)
( apply step-matrix
costs
) make additive or nonadditive
again
The ccode command always must end
with a semicolon. The ccode command does not recognize character names, or
character groups (it does recognize the @ option to
select from a particular block). By default, TNT considers all characters nonadditive (this is
the oppossite of NONA and Hennig86), active, and
of weight 1.
The costs of character-state transformations can be defined with the cost,
smatrix, or cstree commands
(see below); note that the costs for character-state trees are stored in the
costs, so they are a type of "step-matrix" characters. Step-matrix characters for which the transformation
costs fit a character-state tree (i.e. additive relationships between states)
are (for implied weighting purposes) internally recoded as binary.
For many commands, it is necessary to specify series of numbers
(representing sets of trees, taxa, or characters). Numbering for everything starts from 0
(as in Hennig86 and Pee-Wee/NONA). The
characters (if they have been named, see below) and taxa can be accessed using
either their numbers or their names. For
any command that takes ranges, two numbers separated by a period (X.Y) indicate
a range (from X to Y, inclusive). A blank
space before the period indicates that the range starts from the first possible
value, and a blank after indicates that the range ends in the last possible
value. Note that names cannot be
separated by a period to indicate ranges. Except for the ccode
command (where they have a special meaning), the + (or – )
symbols indicate inclusion (or exclusion) from the list. Also, up to 32 groups of trees, characters,
and taxa can be defined. After having
defined a group, enclosing in curly braces ({}) the group number (or name, if
named) is equivalent to specifying each of the members of the group. For taxon specifications, using @X Y refers
to all the terminals included in node Y of tree X; under Windows versions, the
equivalent is accomplished by selecting nodes under tree-viewing mode (Trees/View;
see below). For character specifications,
using @X Y Z refers to the Yth and Zth characters in block X (note that Y and Z can themselves
be ranges; if the data actually contain blocks, the first one is numbered as 1,
and block 0 refers to the entire set of characters). The blocks can be referred to by their names
(if defined, see below; "all" is always used as name for the entire
matrix). Within execution of a given command, the @X for characters remains in
effect untill deactivated explicitly with @0 or
@all. If the data are read as interleaved,
the blocks of data are automatically preserved (although they can be changed
with the blocks command).
Character names can be defined with Data/EditData/CharacterNames, or alternatively, with the cnames command. The syntax for cnames
is as follows:
cnames
[0 First_block ;
[1 Second_block ;
{0
zero state_zero state_one state_two ;
{1
one state_zero state_one state_two ;
{2
two state_zero state_one state_two ;
{+ three state_zero state_one state_two ;
{+ four state_zero state_one state_two ;
;
An additional semicolon is needed to end the cnames
statement. The names must all be defined
together; character name definitions can be changed with cnames +N charname statename(s);
(this changes character/state names, only one at a time). The names preceded by
opening square brackets, [, are names for blocks, and the names preceded by
curly braces, {, are character and state names.
Not all characters (or blocks) must be named in a cnames
statement; name what you need. The +
symbol means "the character (or block) following the last one named"
(if no character or block had been named before, this is the first character or
block). The characters within a block can be named according to their in-block
numbering sequence, by using the @ option:
cnames
[+ First_block ;
[+ Second_block ;
@First_block
{0 zero_of_first .... ;
{1 one_of_first .... ;
{2 two_of_first .... ;
{3 three_of_first .... ;
{4 four_of_first .... ;
@Second_block
{0 zero_of_second ... ;
{1 one_of_second ... ;
{+ two_of_second ... ;
{+ three_of_second ... ;
;
Changing the reading frame with @blockname
resets to 0 the value of the + symbol for character names.
The
step-matrices can be defined from the menus (with Data/CharacterSettings),
or via the cost or smatrix commands. The first defines transformation costs for sets
of characters in particular; the second defines and stores them, for later
application to sets of characters. The cost command has the following
syntax:
cost N = ...costs... ;
where "costs" (see below) are the step-matrix
costs of character(s) N; note that the cost command does not make
characters step-matrix, only defines their costs (the ccode
command is necessary for this, with "(" making the character
step-matrix, and ")" non-step-matrix). The smatrix command
has the following synthax:
smatrix =S
(xxxx) ...costs... ;
where "costs" are saved in internal step-matrix
number S (named xxxx). Once defined, the step-matrix
can be applied to character(s) N, using
smatrix +S
N ;
or
smatrix +xxxx N ;
The syntax for the costs themselves
is similar for both the cost and smatrix
command:
defines as Z the cost of transformation from (any state in) X
to (any state in) Y, and as W the cost of transformation between (any states
in) U or V. X, Y, U or V can be several
states, enclosed in square brackets (as in the xread
command); if the data have been defined as DNA, the IUPAC codes for
polymorphisms apply; using ? indicates
all possible states. If Z or W is
replaced by the letter i, the cost is set to 1000
(presumably, "impossible," but this really depends on the other
costs). When the > symbol is replaced
by a slash, /, this indicate symmetry in transformation costs.
The
minimum possible number of steps for step-matrix characters is calculated with approximate algorithms (in PAUP* the
minimum cannot be calculated at all for step-matrix characters, and therefore
implied weighting cannot be applied).
Beware that for very complex transformation costs (e.g. random numbers
between 1 and 30 for each cost), the algorithms used by TNT can sometimes
overestimate the minimum possible length in a few steps. These algorithms are more prone to error
when there are polymorphic taxa. If in
doubt of whether the program is calculating accurately the minimum possible
steps for a sankoff character, use the usminmax command to determine a
user-defined minimum. If the
user-minimum is set to 1, all sankoff characters will
be considered "informative."
Then, you either optimize all characters, or you can do a search for
each character, so as to make sure what the real minimum for each character is.
The
step calculations during searches are exact for any type of
transformation. The algorithms used
consider as possible assignments for optimizations all the states that occur
between 0 and the largest state (states beyond the largest one occurying in the matrix are not considered; states between
0 and the largest state are considered even if they are not present in any
terminal taxon). This is more or less
equivalent to to the "allstates"
options of PAUP*, with the difference that PAUP* tries (as I understand it) all
the states defined in the step-matrix (instead of the largest state observed in
the terminals). It is possible (although
unlikely) that in the case of complex transformation costs, if some state
larger than the ones present in the terminals is intermediate between others,
it is required to obtain the minimum length. For example, supposse
you have DNA some characters where the only observed states are A, C and G,
where all transformations cost 10, except to or from T, which costs 6; whether
or not T is included among the possible states determines whether the minimum
cost can be 20 or 18 (perhaps even influencing whether the character is
considered informative or not). Then,
you can tell TNT that the maximum number of states to consider for all sankoff characters is some state (in the case of DNA data,
the largest site is T = 3 and a gap is a fifth state = 4; in the case of
protein data the largest state is U = 20, and in the case of 32 alphanumeric
states the maximum state is V = 31).
This must be done prior to the the
definition of costs. This obviously slows down calculations, so TNT pays
attention to this only in those cases when it does make a difference (i.e. only
when length could change by including the larger state(s)).
Note
that the branch-swapping algorithms only look at trees where the currently
designated outgroup is outside everything else.
If you have asymmetric costs for some characters, and you want the
program to choose the rooting that minimizes the overall cost, you have to add
an all-missing ancestor. This is different from the way PAUP* works: when
step-matrix asymmetric characters are present, PAUP* also tries rerootings of the tree (i.e. it does moves which put the outgroup "inside"). If you do not include an all-missing ancestor
in TNT, or do not force monophyly of all but the outgroup taxon in PAUP*, the
two programs may produce different results.
For characters with symmetric transformation costs, the root states will
always include the states in the outgroup taxon among the possible states. This may not be the case for step-matrix asymmetric
characters; for example, suppose that going from 0 to 1 is very expensive and
going from 1 to 0 is very cheap; if the outgroup has state 0, it may be better
to assign state 1 to the root, and then postulate several independent changes
from 1 to 0 than a single change from 0 to 1.
That may or may not be reasonable, depending on what the outgroup states
represent. If the outgroup represents a
series of successive sister groups with the same state (example: a wingless
ancestor for hexapods, with wing gain being more expensive than wing loss),
assigning to the root a state different from the one in the outgroup taxon
would imply a number of uncounted events.
In those cases, it is possible to consider the states in the outgroup as ancestral –that is, choose for possible
state assignments for the root only from among the states in the outgroup
taxon. This can be done by choosing Data/CharacterSettings,
then clicking on the "ancestral" option, and selecting the characters
you want o set. Alternatively, you can use the ancstates command, and include
this in the data file itself. Note that
the designation of a character as "ancestral" has no effect if the
character is not an asymmetric step-matrix character.
TNT implements the implied weighting method of Goloboff (1993), using
floating-point (=exact) fit calculations.
The fit for
characters is (by default) calculated as f = k / ( e + k )
(where e = extra steps, and k = constant of concavity, changed
with piwe=k). During the searches, the program reports the
score as an increasing function of the homoplasy (or "distortion" to
be minimized), which is 1 – f = e / ( e + k ). If you prefer the score reported as fit, it
can be calculated once the search is finished.
Since floating points calculations are used, it is very important that you evaluate group
support when doing implied weighting.
Since exact ties are very unlikely, this criterion may produce overresolved trees if poorly supported groups are not
collapsed.
The fit for discrete additive characters (including those with
character-state-trees) is calculated by decomposing the character into binary
variables. Note that this may produce
fits different from those of Pee-Wee or PAUP* (which do not do this
conversion); the fits will always be measured as higher in TNT; it has been
suggested (De Laet, 1997) that the recoding produces
a more meaningful evaluation of the relative weights. This affects only data
sets with discrete additive characters. If the same characters are defined as
continuous, the fit is calculated without decomposing into binary variables
(thus producing the same fit values that Pee-Wee or PAUP* produce for the
additive coding); treating the additive characters as continuous obviously
slows calculations.
In the case of extra internal steps defined by the user, the same number
of extra steps will be added for each of the variables into which the character
is decomposed. If you want some of the
variables to have different numbers of steps, you have to recode the character
as binary yourself.
The method is applied to Sankoff characters as well, in which case the
fit for a character with e extra steps and a minimum cost among states
of c is calculated as
k / ( ( e/c ) + k ) (where k is the constant
of concavity). When the costs are
defined as all unity, the formula produces results identical to those of the
standard formula used by Goloboff (1993).
If the user has defined some characters has having extra internal steps,
these are added to e.
It is possible to define a user
weighting function, by specifying the values of weights for different numbers
of extra steps, relative to no extra steps.
For this, the data must be read with implied weighting ON. The only requirement for user defined
weighting functions is that no (relative) weight is negative. Weighting functions were the reliability increases
with the first extra steps, and then decreases, are (from the numerical point
of view) perfectly valid. To define a
weighting function, type (or include in a file to be parsed) "piwe["
followed by as many numbers as extra steps you can have in your data set. The numbers are read as the relative costs of
transforming from 0 to 1, 1 to 2, etc.
If the relative weights defined are less than the maximum possible extra
steps, the undefined transformations are given a weight equal to that of the
last transformation defined. If more
transformations than the maximum possible extra steps are defined, they are
simply ignored. The user defined
weighting functions are applicable only to discrete, non-sankoff
characters. When user-defined weights are in effect, the fit cannot be
calculated (i.e. only the score as an increasing function of the homoplasy is
reported). Note that user-defined weighting functions can be easily used for
clique analysis (i.e. defining the function as piwe[1
0 ;).
The tree-view mode is toggled with Trees/View. Under
tree-viewing, it is possible to select nodes by clicking at them (with the left
mouse button). The node selection
toggles between red/green/unselected.
All the terminals included in nodes selected with red will be considered
as "selected," unless they belong to a node selected in green which
is a member of the red selected node (combinations of red/green selections can
be used to easily define paraphyletic groups).
Note that a green selection serves no purpose unless the green node
itself is within a red node.
When
the tree is locked to manual edition (and regardless of nodes selected with the
left button), positioning the cursor at a node and clicking on the right mouse
button produces a list of synapomorphies of the node (using character/state
names, if defined and in effect).
When the tree is unlocked, it can be edited by clicking on the right
mouse button. If a single node (red or
green) is selected, and the right button is clicked while the cursor is on an
unselected node, the selected node (=source) is clipped and moved to be sister
of the unselected (=target) node. If the
target node is the ancestor of source node, the branch leading to the source
node is collapsed. If the right mouse
button is clicked on a red selected node (regardless of whether other nodes are
selected), all the terminals belonging to the node are pruned from the tree
(unless they in turn belong to green selected nodes, in which case they are
left in the tree). If the right mouse button is clicked on a green selected
node, and the tree is incomplete, the user can select a terminal to add to the
tree (as sister to the green selected node).
When
the tree is edited, the changes are irreversible. If you are not sure of whether you want to
keep the changes, you should save the tree to file before editing it. To avoid mistakes, the trees are by default
locked to manual edittion; they can be unlocked with Settings/Lock
trees.
The program can save, during
branch-swapping, suboptimal trees. The
values of suboptimal are set with Analyze/Suboptimal,
or with the suboptimal command.
Two values can be set: the absolute and the relative fit difference to
accept a rearrangement; the relative fit difference (Goloboff & Farris,
2002) is calculated between the tree being swapped and the rearrangement. These two values take effect also when
collapsing trees for consensus calculation with temporary collapsing under SPR
or TBR. The relative fit difference
between the tree being swapped and the rearrangement is calculated exactly (see
Goloboff & Farris, 2002, for details on how this can be done quickly),
except for TBR with asymmetric step-matrix characters, where the relative fit
difference can be either over or under estimated by the algorithms used.
Note that the new technology algorithms (sectorial searches,
tree-fusing), since they are not designed primarily to save multiple trees, do
not have a well defined behaviour in terms of the
suboptimal trees saved; some trees will be saved, others won't. Thus, if you
want suboptimal trees, do branch-swapping after having concluded a search with
new tech algorithms.
Except for the stuff that windows itself uses, memory management in TNT
uses special memory allocators written by Steve Farris. This memory manager insures that TNT has no
memory leaks. For example, reading a new
data set is guaranteed to first set the whole system back to its original
state, and then reallocate memory. The
same applies to macros, which use a different memory segment. The only thing that can cause memory
fragmentation when using TNT is repeating some of the following operations many
times: (1) re-setting the maximum amount of “general RAM” the program can use,
(2) re-setting the size of the display buffer, (3) re-setting the total amount
of RAM to be used by the macro language, or (4) re-setting the number of nested
input files. These things use normal
memory allocators. Other than this, TNT
can be used for any period of time, or for any number of operations, without
causing any memory fragmentation.
TNT implements several rules for tree
collapsing (see Coddington and Scharff,
1994, for discussion). For rule 1, which
may produce collapsed trees longer than the original resolution, the
length-calculating commands (for example, bremer
support or the tree filter when suboptimal trees are to be discarded) calculate
the length before collapsing; note that the final collapsed trees may be
longer. The collapsing for consensuses
(including the bremer support) is calculated only
temporarily. Note that rule 4 cannot be
used during consensus calculation (because it may multiply the number of
trees), but the results for the (strict) consensus under rule 1 will by
necessity be identical to those for rule 4.
The implementation of most rules
requires no special comments. The
exception is rule 4, implemented in TNT as follows:
1) take the tree to be collapsed
2) collapse all nodes for which ancestor and descendant state
sets are identical
3) mark all nodes for which ancestor and descendant state sets
share state(s).
4) for each combination of
possible collapsings for the nodes marked in 3, if
the length of the new tree equals that for original tree, and no node in the
new tree remains unsupported under rule 1, add the new tree to memory.
5) if any of the trees added to memory at point 4 is a
duplicate of a previously collapsed tree, discard it.
The trees produced will thus not be further collapsed applying
rule 1 to them; when the trees are only condensed, existing duplicates are not
eliminated automatically (they are when the trees are
"filtered"). This
implementation of rule 4 is almost always more time consuming than the other
rules (except SPR or TBR, naturally), in some cases, much more time consuming.
For some input trees, this implementation of rule 4 may produce a larger number
of output trees. Coddington
and Scharff (1994) did not prove that all rule 4
trees can be found by collapsing all rule 1 trees, but it is easily proven (De Laet, pers. comm.) that, if the trees are a set of all the
trees distinct under rule 1, just deleting the trees that become longer when
collapsing produces a set identical to the rule 4 set. However, if the trees to be collapsed are a
subset of the rule 1 trees, discarding the trees that become longer may produce
spurious results –and that’s why rule 4 is implemented as it is. If you are
certain that your trees are all the distinct rule 1 trees, you can get the set
of rule 4 trees by simply (a) collapsing the trees under rule 1, and then (b)
getting rid of the trees that are longer (filtering without collapsing).
The branch-swappers
use only binary trees as input. If some
non-binary trees exist when a search is invoked, they are automatically
resolved before the search proceeds.
This may cause some trees that were distinct to become identical when
resolved. Additionally, collapsing under
rule 1 (i.e. collapse all ambiguously supported branches) may make the trees
longer than the initial binary trees.
When any of these two things happens, a search may deliver N optimal
trees of minimum length, but if a subsequent search is done, only a reduced
number will be used as input for subsequent searches (this happens very often
in PAUP*, which always collapses the trees after a search). To prevent this, TNT by default retains the
trees as binary after the search (note that the trees are anyway compared to
make sure that they are distinct if they
are collapsed under the criterion in effect). Using these trees for consensus, bremer support calculation, etc., requires that they be
collapsed temporarily. The temporary
collapsing is set with collapse+; (or under Settings/ConsenseOptions).
When trees are
temporarily collapsed, eliminating taxa requires special considerations. The characters of the taxa still influence
the optimizations; only the taxon positions are ignored. For SPR and TBR collapsing, the results
produced are those that would be obtained by (1) saving all the binary trees
that could be found by swapping on the tree to be collapsed (if a better tree
is found, swapping continues nonetheless from the original tree), (2) pruning
all the taxa to be eliminated, and (3) consensing the
pruned trees. For optimization-based
collapsing, evaluation of support for a branch considers whether merging two
branches in a singe branch when clipping a taxon produces a supported branch (note this is not the same as first collapsing the
trees under rule 1, and then clipping taxa).
Eliminating taxa when resampling, estimating consensuses, or searching untill the consensus becomes stable, is done in the same
way. If you want to remove the taxon
from the analysis (as opossed to removing it
from the consensus), deactivate the taxon (with the taxcode
command, or the Data/Taxa menu option).
Beware that,
although tree-collapsing involves no special theoretical considerations, it is
of practical importance. Misusing the
collapsing possibilities provided by the program may produce absurd results.
With File/CommandLine,
it is possible to enter commands for TNT.
<Tab> switches focus between main window and command line;
<enter> always takes you to the command line. When focus is on command line, pressing – or
+ moves screen one page up or down (respectively). The keys / and \ move through the buffer that
remembers the last commands executed. A
list of the commands (with their syntax) is obtained entering "help"
at the command line.
You can also give TNT (essentially) any
commands you want when invoking it from the DOS prompt (in the Windows version,
commands must be preceded by a semicolon; strings before a semicolon are
interpreted as names of files to open).
The cost command takes the redirect output symbol, >, as a specifier; in the command line the > symbol can be
replaced by a closing curly brace, }.
TNT allows instructions to be stored (in memory or in file), for later
execution, under Settings/Batch.
Note that some purely interactive functions (like searching for text in
the text buffer, or selecting tree nodes under tree viewing mode) cannot be stored
as batch instructions. To store
instructions, choose Set menus to batch, and then simply click on the
options you want (you can later edit them).
When the instructions are subsequently executed, they are run in
order. Make sure the warnings are disconnected
before running a set of instructions (as otherwise the program may halt when it
finds a warning).
TNT
implements two types of character support measures: Bremer supports and
resampling. The Bremer supports are
calculated from the trees in memory: it is up to the user to find suboptimal
trees. The resampling can be of
different types, some of which are preferrable to
others. Standard bootstrapping is
influenced by uninformative characters (and by characters irrelevant to
monophyly of a given group).
Bootstrapping (both standard and Poisson) and jacknifing
(except for a resampling probability of 50%) are affected by character weight
and transformation costs (e.g. additive characters). Symmetric resampling is affected by
none. Outputting results with the
frequency may produce alterations in the apparent support for groups with very
low support (and it cannot measure support for groups with very low
support). Both GC and frequency slopes
solve this problem (for slopes, supported groups have negative slopes; the
closer to 0, the better supported the group).
The supports can be measured on the groups present in any given tree
(normally, a most parsimonious tree, but using other trees provides a way to
measure how strongly contradicted some groups are). The slopes must be measured by reference to a
pre-existing tree.
If
some taxon has its position in the tree very poorly defined, it may strongly
decrease the support for many groups in the tree. If you wish to find out whether the rest of
the tree is well supported, you can eliminate the taxon from the consensus when
resampling. To find out which taxon is
floating around you may have to first do a more superficial estimation, saving
the trees, and trying to prune different taxa (note that in such case you have
to take the precaution of not collapsing the trees with SPR or TBR, as
otherwise you may never find the floating taxon: once you found it, you turn
collapsing back to SPR or TBR –the most effective—and do a more careful
estimation of the resampling frequencies eliminating the evil terminals).
Linux and Mac OS X versions of TNT
have a couple differences with the standard versions. First, those versions can take a comma ( , ) instead of a semicolon ( ; ). This is because the shells there use
semicolons to separate commands, and giving arguments to TNT from the command
line becomes difficult. Second difference is that those versions can run in the
background. If you want to start TNT in
the background, use bground as first argument,
then give the normal arguments you would give the program (do not activate
reports and do not change the silent status, but you can set
report+ to report every certain number of trees, replications, or seconds, so
that you can monitor progress by inspecting the output file; make sure you do
have output files, otherwise the program would run and exit without doing
anything!), and end with the usual '&'.
TNT will not write to the console, or expect console input (at least if
no errors occur). It will write its results to a file and exit smoothly (exit
code is 0; if the program runs out of commands while in the background it exits
with code 1). Should some error occur during the search, the program writes a
message to the ouptut file, and exits with code 255. Try
./tnt bground p zilla.tnt, echo=, log peep, rep+1, mu10=ho3, le, ne, quit,
&
and do "tail –f peep" to monitor search
progress.
If you install the PVM system in
your machine, you can run TNT in parallel.
If you have difficult data sets, this may be advisable even if you have
a single-processor, single-machine, as the parallelization interface of TNT
allows for a lot of control in defining search routines (e.g. you can re-analyze sectors of the tree using any search routine you
want). Unlike other phylogeny programs
which can search in parallel (e.g. POY; Wheeler et al., 2003), TNT does not
have pre-established routines for working in parallel. Rather, TNT allows to parallelize routines,
interacting with the scripting language; although harder to use, this is much
more flexible and can be tailored to specific needs. The command that handles most of the options
for parallelization is ptnt. With the command ptnt
begin TNT can launch jobs in
parallel; each job can consist of one or several tasks. Each task in the job
receives a set of instructions; the instructions are determined by the user,
and can be (essentially) any set of TNT commands (possibly including
instructions to launch other jobs in parallel). After launching a job, the
master TNT process continues operation normally (i.e. asynchronously). The results from the slaves are brought back
to the master with the command ptnt get;
the master does not receive results from the slaves untill
the master itself executes a ptnt get command;
slaves that finished running remain dormant (i.e. not using processor time)
until the master requests results from them (or until the master sends the
slaves a new set of instructions). The master may launch several jobs at the
same time; up to 32 jobs can be running at the same time, and the master may
handle exchanges of trees/information between them. When a job is launched, it
automatically receives from its parent the data (and, optionally, starting
trees), as well as its identification number (for t tasks, numbering goes
from 0 to t-1). Since a slave task
knows its own number, this can be used to have some tasks perform differently
(alternatively, the same thing can be accomplished by launching other job(s)).
A TNT process that launched some jobs can monitor (with ptnt wait
or ptnt gwait)
the progress of the jobs, letting the user access to numbers of trees,
replications, scores, etc., in the slave tasks; based on this information, the
parent process can then recall jobs back (ptnt get),
tell the slaves to skip some instructions and move to a specific part of the
instructions (ptnt goto and
ptnt skipto),
or
send new sets of instructions to the slaves (ptnt again). The parent process can get trees from its
slave tasks, at any point (with ptnt spy).
A
very simple example shows how to parallelize resampling, in a job that receives
the name psample ("parallel sampling"):
ptnt begin psample
10 /*
run 10 slave tasks */
+. –myself =
/* run in all hosts but master
node */
/* …these are the slave instructions… */
collapse tbr ;
resample rep 10 savetrees ; /*
resample */
tchoose 1. ; /* choose individual trees */
return ;
ptnt wait ( tasksleft[psample] = = 0 );
/* wait for all tasks to end */
keep 0 ;
ptnt get psample
; /* get results from psample */
collapse notemp ; /*
trees are already collapsed by the slaves */!
freqdif ; /* calculate freqdifs */
If the number of tasks (10 in this
case) is not specified, as many tasks as hosts minus one are used (and no task
is launched on master node); if no host selection is specified, the default
selection is used (i.e. the same one indicated in the example).
TNT implements the consensus estimation method of Goloboff and
Farris (2001). This is based on making a
certain number of independent searches, and checking whether the end results of
all of them share groups. A group which
is present in all (or most) of the searches, even if the search failed to
produce optimal trees, is likely to be present in the set of optimal trees (and
thus the consensus). This can be used to have a quick idea of the approximate
results for your matrix, or to create constraints to speed up searches. The consensus estimation is invoked with the qnelsen command, or under Analyze/ConsensusEstimation.
The program implements several types of search algorithms; in
Windows versions, all of them are accessed with the Analyze menu
option. For really small data sets (i.e.
below 20-30 taxa), exact solutions can be obtained via implicit enumeration (or
"branch-and-bound"). For larger data sets, heuristic
("trial-and-error") methods are required. The basic algorithms
implemented in the program are wagner
trees (with sequential addition of the taxa, in the best available position, to
form a tree as short as possible), and different types of swappers and
algorithms that make rearrangements on preexisting trees. Branch-swaping can be of type SPR or TBR.
For relatively messy, but not very big
data sets, the best algorithm consists of multiple random addition sequences
plus TBR (RAS+TBR); this is invoked with Analyze/TraditionalSearch,
using wagner trees as
"starting trees," or with the mult
command. Whether or not you are likely to have found the best trees, or
all the tree islands, depends on the ease with which the algorithm converges to
the minimum length found. For this reason, the program reports the number of
replications that reached the best score found.
The program also reports whether some replications had partial overflows
of the tree file, in which case, some most parsimonious trees may not have been
found during the search; further branch-swapping starting from the trees in
memory ("RAM") can be used to find the additional trees.
For more difficult
data sets, the special and combined algorithms will be required. Note that
instead of wasting search effort, it is often preferrable
to make one very aggressive search to make sure what the best score for your
data set is, and then produce additional (independent) hits to that score until
the strict consensus has become stable (as discussed in Goloboff, 1999). There
are four basic types of special algorithms implemented in TNT: ratchet,
drifting, sectorial searches, and fusing.
The ratchet in TNT is slightly different from the ratchet as originally
described by Nixon (1999), and as implemented in Nona/Pee-Wee. The ratchet consists of two phases,
perturbation and search, which are sequentially repeated. In either phase, the ratchet in TNT does not
look for multiple trees (in Pee-Wee/Nona it was necessary to set the number of
trees to save in each phase, but the best number is always just one); multiple
trees can be found only as the search phase finishes. The ratchet always uses TBR branch swapping,
and it alternates the search phase (S) with three types of perturbation phase:
original weights (O), upweighting (U), and deleting
(D). Thus, the cycles will be:
O,S,U,S,D,S,O,S,U,S,D,S,O… etc. Any of the O, U, or D phases can be skipped
(from the Ratchet settings dialog, or with the ratchet command). During the perturbation phase, rearrangements
of score equal or better than the tree being swapped are always accepted, untill a certain number of rearrangements have been made
(or untill a certain percentage of the total swapping
on the tree has been completed). This
seems to provide better results than the original ratchet. The "autoconstrained" option calculates the (strict)
consensus of the previous tree and the tree resulting from the rearrangement
phase, and during the subsequent search phase, only rearrangements that do not
violate monophyly of the shared groups are done. The rationale for this is that improvements
are likely to be found in areas that have changed during the perturbation
phase. Every certain numer
of constrained cycles, an unconstrained search phase is done (to insure global
optimality). If this is used in
combination with user-defined constraints, both are combined.
Tree-drifting is quite similar to the
ratchet, but the perturbation phase, instead of reweighting the characters,
accepts the moves with a probability that depends on both the relative fit
difference (Goloboff and Farris, 2001; see Goloboff, 1999) and the absolute fit
difference between the tree being swapped and the new tree. It has the advantage that memory managing is
easier, and for that reason is the only perturbator
that sectorial-searches (see below) can use. The perturbation phase also
alternates between acceptance of optimal trees only (O), and suboptimal trees
(U): O,S,U,S,O,S,U… etc. Any of the O or U phases can
be skipped (the O phase, with drift:noequal;
or with the skip optimal-only drifting option in the tree-drift settings
dialog box; the U becomes effectively an O by setting the score difference to a
very low number).
Sectorial-searches are of two basic
types: constraint-based and random; a combination of both is the mixed
sectorial searches. Sectorial searches
take a portion of the tree, create a reduced data set, and produce a
mini-analysis of that data set. If a
better solution for that portion of the tree is found, it is replaced back into
the original tree. Every certain number of (user-determined) rearrangements, a
round of global TBR is done (to insure global optimality). If the sectors are below a certain size, the
mini-analysis consists of three RAS+TBR (random addition sequence wagner trees plus TBR); if the three sequences end up in
trees of the same score, the mini-analysis is interrupted (i.e. the reduced
data is "clean" and not worth of further analysis); otherwise, three
additional RAS+TBR are done. If the sectors
are above that size, a certain number of iterations of tree-drfiting
is done to the reduced sector. The three
types of sectorial-search differ in how the sectors are selected. Constraint-based searches choose the sectors
by reference to constraints: the nodes that are resolved as polytomies in the
constraint tree, connecting to no less than N branches (where N is some number
defined by the user; default is 10). A reduced data sets corresponding to each
of the polytomies in the constraint tree is analyzed, in turn; this is repeated
a certain number of times (default=3), because changing some sector may imply
that other sectors should change as well.
Random sectors are chosen at random, to be no smaller and no bigger than
a certain (user defined) size. The mixed sectorial searches use both
constraint-based and random sectors; they can only be used with multiple
addition sequences. For the mixed
sectorial search, a temporary constraint tree is created by consensing
the tree found at some stage of the present addition sequence (user defined:
this can be the wagner tree,
which is the default, or optionally the tree produced by SPR or TBR), and this
constraint tree is used to do a constraint-based sectorial search. Once this is finished, a random sectorial
search is done (unconstrained, or using only the user-defined constraints if
they are in effect).
Tree-fusing mixes
trees, producing better trees. If the trees comes from different searches, the
score is often considerably improved in very short time. Tree-fusing produces (so to speak) a
synergistic effect for all those previous searches; this also means that the
rate of score improvement is not necessarily lineal (i.e. all the searches
before fusing were not producing a score improvement beyond some bound, but
were nonetheless accumulating "momentum" for the subsequent fusing).
For many data sets, the time to find best scores during searches
may depend drammatically on the aggresivenes
and parameters used during the search.
For example, using very superficial searches for difficult data sets
will keep grinding for a very long time before the best score is actually
found; using more aggressive parameters will find best score sooner. The parameters to change are size and number
of sectors to run under random sectorial searches, number of initial
replications to submit to fusing, and rounds of tree-drifting or ratchet.
A careful
consideration of how the data set behaves to changes in parameters is the best
way to proceed, but if you have no idea of how a data set behaves, or don't
know how to set the parameters for a search, you may trust to the program the
choice of parameters. This is done with
the set initial level option of the New Technology dialog box (or
the level option of the xmult
command). If entered as a
command, the level must be a number between 0 and 10 (0 is very superficial; 10
is extremely exhaustive); if entered in the dialog box for search level, it must
be a number between 0 and 100 (again, 100 is the heaviest search). If using a driven search (i.e. more than a
single hit of the xmult command),
you may have the program check whether best score is being found easily or not
(and then decrease or increase the search level accordingly). This is set with the chklevel
option of the xmult
command. Again, letting the program try
to determine the best parameters during the search is a poor substitute for
finding the best parameters and letting them fixed during the search, and is
intended for cases where you have no time/experience to determine best
parameters by yourself.
This section describes some details on
how the searches are implemented. It is
intended to give the user clues on how to best use the program, or at least
understand some of the ways in which the program behaves under different
circumstances.
When all the characters have equal
weights, the swapper proceeds the fastest.
If some characters are weighted differently, the search algorithms
proceed somewhat slower (i.e. it takes longer to complete a wagner tree or swap completely through one tree; note
that finding a shortest tree might still be easier if the weighting increases
the structure of the data set). If the
characters are weighted using very different weights (e.g. every character is
assigned a weight from 1 to 50, at random), the search algorithms reach a floor
(depending on the data sets, about 50% slower than equal weigths).
Searching under implied weights, since
it uses floating-point calculations, is the slowest, being 2 or 3 times slower
than equal weights (note than in PAUP*, using implied weighting slows down the
searches by orders of magnitude). The
largest speed differences with PAUP* are in running large data sets (between 20
and 50 times, sometimes more than 100 times), and implied weighting (over 500
times, for large data sets).
The numbers of states also affect
running times. The more states the
characters can have, the slower the program runs. The program internally packs
characters in groups of characters with 2 states, 3 to 4 states, 5 to 8 states,
9 to 16 states, and 17 to 32 states. Sankoff characters are packed always
separately. This implies that the time
differences will come when the numbers of characters in each of those
categories changes significantly. For sankoff-style characters, the time increases rapidly with
the square of the number of states (in some portions of the code, the time
increases actually with the cube). Thus,
the algorithms used react badly to increasing numbers of states; for Sankoff
characters with 16 states or more, TNT is about as slow as POOP* (or even
slower). [
N.B. as of March 2023,
these portions of the code with dependence O(S3)
have been substituted by code with dependence O(S2), thus significantly speeding up tree searches for Sankoff
characters with many states. This section is left “as is” for documentation
purposes… ]
The implementation of searches under
implicit enumeration is similar to that in other programs, and there is nothing
special about it. As usual, exact
searches take the longest for the least poorly structured data, and the times
increase drammatically with the numbers of taxa. Since TNT is conceived to work mostly with
large data sets, there seemed to be no much point in using energy to optimize
the implicit enumeration algorithms.
The tree swappers can complete swapping
on a tree faster if no additional equally parsimonious
trees are being found. This is because
in such a case length calculations for a rearrangement can be abandoned as soon
as the program discovers that the tree is as long as the best one found so far;
otherwise, additional characters have to be checked to make sure the tree is no
longer than the best one found so far (and, to make things worse, the tree has
to be optimized for collapsing and compared to all the pre-existing trees, to
make sure it is not identical as collapsed).
This is used to speed up searches under multiple random addition
sequences, where a reduced number of trees is retained for each replication
(such that most of the time the search is being done with a full memory
buffer).
Except in the case of constraints, the
addition sequence for wagner
trees is totally randomized, except for the outgroup taxon (which is always the
first taxon added). The insertion
sequence of new taxa to the growing tree is (by default) not totally random:
the possible locations for the new taxon are tried from the root of the tree to
the tips, or from the tips to the root (which option is chosen is determined at
random, for each taxon to add; both are equiprobable). In some extreme cases, this may be
undesirable. If the data are entirely
uninformative (e.g. a matrix with no characters) and no tree collapsing is in
effect, then pectinate trees result from such
insertion sequence. It is possible to randomize the insertion sequence for new
taxa, so that all locations are tried in a random order. This is set with rseed[;
the alternative is rseed];
(default, up/down sequence). For large
data sets, building a wagner tree with a randomized
insertion sequence may actually take shorter than with the up/down sequence
(that is probably because the best placement is often found somewhat in the
middle of the tree, not in the tips or root; checking the tips or root first
provides worse bounds to give up length calculations, for a greater portion of
the time). For data with no informative
characters (and with collapsing turned off), wagner trees with a random insertion sequence are
(more or less) equivalent to random trees.
If your data are more or less structured, setting one or the other is
irrelevant.
Note that randomization of the insertion
sequence has no effect when wagner
trees are built under constraints (i.e. the insertion sequence is always
up/down in this case).
The two types of swapper are designed to
work as best as possible under diferent circumstances.
The time to complete swapping on a tree
(that is, on a tree which does not lead to better trees by swapping) changes
with about the square of the number of taxa for both SPR and TBR (this is true
even when TBR actually “looks” at a number of rearrangements which increases
with the cube of the taxa).
SPR is designed to work best at the
initial stages of a search, when better trees are being found frequently; it is
not as fast as it could be for trees which are already optimal or
near-optimal.
TBR is designed instead to work best for
large numbers of taxa, when the trees being swapped are optimal or
near-optimal, that is, when better trees are not being found frequently, and
also to work best on well-structured data (i.e. data where there are numerous relatively
well supported groups). If the data are
very poorly structured, swapping through a tree takes longer. For Källersjo et
al.’s 2500-taxon matrix (1999), the time to complete TBR swappping
on a near-optimal tree (running on a 800 MHz PIII machine) is about 12 secs.; PAUP* on the same machine takes over 18 minutes to
complete TBR swapping on the same tree (i.e. a 90 times speed difference; PAUP* takes about 6 minutes to complete SPR
on the same tree, so that completing TBR for 2500 taxa in TNT is 30 times
faster than completing SPR with PAUP*).
Neither type of swapper is highly
optimized for low numbers of taxa. For
small data sets, expect no big differences ( i.e. no
more than 5-10 times) in running times for TNT and other programs. These take no time anyway, so it seemed best
to preserve an implementation focused on large data sets.
Heuristic
searches.- There are three
types of constraints: positive tree-constraints, positive group constraints,
and negative group constraints. Tree
constraints are defined with the = option of the force command (or, from the menus, as “positive constraints” with
no floating taxa). The latter are
defined with all other options of the force
command (or, from the menus, with positive constraints with floating taxa,
or negative constraints). The three
types can be used simultaneously (e.g. “search for the best trees where A+B+C
from a monophyletic group, but A+B do not”).
The tree constraints
are implemented in such a way that the moves violating the groups established
are never attempted. This type of
constraint therefore speeds up the searches (if many groups are defined, by a
significant factor). Note that the number
of trees examined to SPR-swap or TBR-swap through a single tree is the same
under constraints (i.e. the moves are counted as if effected and rejected).
The group
constraints (positive and negative) are implemented internally by creating a
matrix of group membership variables (where taxa belonging to a group have
state 1, taxa outside the group have state 0, and floating taxa have a missing
entry). For group constraints, all the
same moves are evaluated for length. If
the tree is within the current length bound, then the group constraints are
evaluated, and if the new tree does not pass, it is rejected (even if
shorter). Thus, using group constraints
does not speed up searches at all. If
the group constraints make the tree much longer (i.e. if they forbid many
groups present in the consensus of MPT’s), they will actually slow down the
search (i.e. many moves will produce better or equally good trees, which have
to be subsequently rejected because they violate constraints).
For wagner trees, the program adds each taxon in a
position such that no constraint is violated at any point. For negative constraints, this means that the
first two taxa which are part of a group constrained for non-monophyly must be
placed as non-sister groups. Of course,
a better tree might be found if the non-monophyly of the group is delayed (to
be broken later, for example, by adding another taxon in between them), but
then if non-monophyly has been delayed, it may now be impossible to create
non-monophyly without at the same time violating some positive
constraints. This means that, for
negative constraints that increase tree length considerably, the wagner trees may be quite far from the best length
achievable under the constraints (e.g. it is advisable not to use wagner trees to estimate consensuses or jacknife
frequencies under those circumstances).
Additionally, the random addition sequence may change somewhat when
positive (tree or group) constraints are used in combination with negative
constraints. The first two members of a
group to be non-monophyletic must effectively be non-monophyletic when they are
both added to the tree, but this may be impossible if those two are in turn the
only two members of a group constrained for monophly
added so far. The addition sequence is
then checked when positive and negative groups are enforced, changing it to
prevent this. The addition sequence is
therefore not completely random when there are positive and negative
constraints in effect.
Exact
searches.- Both positive tree constraints and
positive group constraints speed up searches under implicit enumeration (the
more so if they constrain groups which are anyway present in the optimal
trees). Negative tree constraints slow
them down, to the extent that they make the best possible trees longer.
TNT has a scripting language, which allows the user to automate complex
tasks, or "program" TNT to do things it normally doesn't do. The scripting language has flow control,
loops, and internal variables that can be accessed by the user. The scripting language is normally disabled;
it is enabled by entering "macro=;" at the command
prompt. The macro command also
controls the basic settings for the macro language, like total RAM for macros,
number of variables, number of nested loops, and protection level for accessing
variables/arrays (for details, see online help for macro).
The variables and flow control use
numbers (actual values or internal variables) or expressions (compound
expressions, always must be enclosed in parentheses to be properly parsed).
Examples of scripting language can be
found in the files contained in zipdruns.exe (a self-extracting excutable). Exactsol calculates trees by implicit enumeration
(using a complete down-pass optimization for each tree: it is very slow; the addtax routine is used as a function from
within exactsol). Macswap does
SPR swapping on a tree, and prints out a message every time it finds a better
tree (it also uses down-pass optimization to evaluate trees). Sank does sankoff-optimization;
it is more elaborate than the other two (it can be used only on small trees;
make sure you have set up the number of variables to a large enough number).
User variables.- User variables can be accessed in any context,
by enclosing the number of variable to be accessed in single quotes. Thus, if variable number 0 has a value of
1200, inputtting the string hold '0' is equivalent to inputting hold
1200. Variables can be given
names, in which case their values are accessed by enclosing their name in
single quotes. Using double quotes
instead of the single quotes is equivalent to using single quotes, except in
the case of parallel versions of TNT, where double quotes passed as
instructions to slaves are interpreted on slave, and single quotes are always
interpreted locally. Using the quotes or double quotes properly, the user
variables can act as pointers. User
variables can be defined (when declaring their name) as arrays (maximum is
5-dimensional arrays). The indices are
then enclosed in brackets, separated by a comma.
Thus, 'n[j]' is the
j-th value for variable n (note that the arrays are
zero-based), and 'p[x,y,z]'
is the z-th value for row x, column y, of variable p.
Note that 'n'[j] will be parsed as the number corresponding to variable n
followed by the expression [j] (which, according to context, may or may not be
a legal expression). The indexing works
by substituting values; each time a new "dimension" is found, the
cell is considered to be that for the base cell plus the value of the
dimension; the user could set the values so that a "matrix" is
properly organized, but declaring a variable as multi-dimensional when it is
given a name automatically sets up the arrays.
It is also possible to enclose in the
single or double quotes an expression (which must be enclosed in parentheses);
what is inside the parentheses may, in turn, contain references to user
variables. Thus,
set 5
'(%1)' ;
will set variable 5 as equal to the variable
numbered as the first argument.
Similarly, if
variable 0 equals 10, variable 1 equals 20, and variable 30 equals 100, then
the expression
set
5 '('0'+'1')' ;
will assign the value of 100 to variable 5 (i.e.
assigns to variable 5 the value of the variable whose number equals the sum of
the values of variables 0 and 1).
User variables can also be used to
store strings, accessed by preceding the number (or name) of the variable by a
dollar sign, $. Strings can be stored as
literal (in which case, expressions like '0' within the string are replaced by
the value that variable 0 has at the time of expanding the string; variables
can be expanded only as numbers, as well as arguments; no string expansion is
allowed within strings).
The values for the user variables can
be changed only with the set command. Set followed by a variable number
(or name), followed by an expression, sets the value
for the variable to the value for the expression. The number of the variable to be set can
itself be an expression (but if so, remember, it must always be enclosed in
parentheses). The expression can be an expression as in the C language; logical
expressions are evaluated to either 0 (FALSE) or 1 (TRUE). The expression can be replaced by either ++ , -- , += , -=, *=, /= (with meaning as in C; note that
only post-decrements and post-increments are allowed). Thus
if (
length [ 0 ] > ' maxlen' )
set
1 '1' + 1 ;
else set 0 '0' + 1 ;
end
will increase in one the variable number 1 if the
length of tree 0 is greater than the value stored in the variable named by the
user as maxlen, and otherwise
increase in one the variable 0. The same
is achieved with the more compact expression
set ( length [ 0 ] > 'maxlen' ) ++ ;
Expressions.- Expressions can be any combination of
operations. Valid operators are + - *
and /, as well as bit-operators ( & | ^ ). Valid logical comparisons are ==, > , < , >= , <= , ! , and !=. Since precedence is evaluated always from
left to right, be careful to delimit subexpressions
as desired:
if ( 'maxlen' <= 1000 + 500 )
is actually equivalent to
if ( ( 'maxlen' <= 1000 ) + 500 )
so that it will
first evaluate if maxlen is less than 1000, and
then sum the result to 500 (if maxlen is less than
500, result is 501, otherwise result is 500; the expression always evaluates "true" ...
probably not what you wanted). Either
remember the precedence and invert the order of expressions:
if (
1000 + 500 >= 'maxlen' )
or use the original order and parentheses :
if ( 'maxlen' <= ( 1000 + 500 ) )
Note that
parentheses must always be balanced.
Changing values of variables.- The values of variables are changed with the set
command:
set N (expression) ;
The
"expression" may in this context not be enclosed in parentheses (this
is the only exception to the rule given above).
The expression in this context always must end with a semicolon. The example above will set variable number N
to the value of the expression. You can
set the i-th variable (where
i itself is any expression) from N, as if N was an
array, by using:
set N[i] ( expression) ;
If some private variables
have been defined in previous files (see "Private variables" below),
the number N is set automatically to the number within that file (use names to
access variables from previous files).
To set variables as strings, use:
set N $text ;
text can be any number of characters, with the
proviso that the string starts at variable N, and continues to subsequent
variables; every 4 letters of the string occupy one user variable (which is a
long int), so that the actual number of user
variables needed to store a string is the length of the string divided by
4. The semicolon indicates the end of
the string (which can contain blanks or spaces; a period followed by a comma is
replaced by a semicolon, as in the quote command). If "text" contains expressions to
be replaced (i.e. single quotes or symbols for arguments), it is replaced on
reading. It is possible to define the
string as "literal" by including an equal (=) sign before the number
of variable:
set =N $text ;
A literal string has
expressions expanded when it is itself expanded (so that it can be expanded to
different strings). No string expansion
is allowed within strings (although it may work in some cases, its results are
not predictable).
A string can be copied from one
variable onto another easily, with the set command:
set 10 $$0 ;
copies string 0 onto string 10 (make sure string 0
doesn't extend to string 10 before doing this).
Private variables.- When input files are nested, all the variables named in
preceding files are "private" –that is, an internal numbering
starting from the last unnamed variable is used. Accessing variables by their number is then
relative to a file. If a name is given
for a variable, the variable names are checked from latest to earliest, so that
if variable names are duplicated in nested files, the most internal files will
refer to their own variable (for example, simple counters like "i" or "a"); names defined in a preceding
file will be recognized only if not defined later. The number of private variables is determined
by the largest unnamed variable, so if you want to refer to some variables by
their number (for simple counters), it is wise to reserve the lowest numbers
for those, and start naming from variable 10 or more (this allows to keep
"memory management" easy and clean). You can also keep more variables
than the named ones as "private," simply by issuing a private
n; command (where n is the number of private variables) at some point
before calling the next input file.. At the start of each file, you can also
include a safe vars n; statement (which checks
whether n user variables are
actually available to the user from that file, and calls an error message
otherwise).
In this way, each input file can have its own internal
numbering of the variables; this allows to define "functions" in files, that are safely called from other files. The only
way in which a file can access a variable named in a previously opened file is
by referring to it by its name (and only if the same name did not exist in the
file itself: in that case, the variable corresponding to the file itself is
accessed). When an input file is closed, all the variables that had been named
in that file are denamed; if you want to internally
rename variables within a file, you have to issue first a var-;
command, which denames all non-private variables (var-N; denames
all variables above N, N included).
Changing values for arrays.- The setarray command
sets arrays. This can be used to read in
matrices (of any dimension):
setarray 5,4 matrix
0 1
2 3
4 5
6 7
8 9
10 11
12
13 14 15
16 17 18 19
;
this will set the values for a variable called
matrix. The numbers can be any valid
expression; note that 0/1 matrices must have a blank space between the numbers.
Naming user variables.- User variables can be named, and accessed
through their name:
var
=
0 tree_number
1 result
2 other ;
If private variables
have been declared in previous files, the numbers (which can themselves be expressions)
correspond to the internal numbering for current file. If any number is
replaced by a + this is interpreted as "the lowest unnamed variable"
(if there are private variables, the first time + is used within a file, it
refers to the lowest possible number that can be accessed within that file). If
you are going to store values starting from variable number N, you can define
the number of variables to store by using a comma and the number of cells,
after the number:
var =
N, (ntax+1) Taxon_list
+, (ntrees+1) Tree_list
+, (nchar+1) Char_list
+,(ntax+1),(nchar+1) Matrix
;
After defining the
name for variable 10 as "Taxon_list", it
will consider that the next variable to be named is variable number 10+ntax+1,
so that the variables do not overlap. If
there are 12 taxa and 15 trees, then the name "Tree_list"
is assigned to variable number 22, and the name "Char_list"
is assigned to variable 37. Since it is
always possible to refer to variables only by their names, the user does not
need to worry to which particular variable a name has been assigned.
An alternative (equivalent) format for
declaration of multidimensional arrays is:
var =
N Taxon_list
[ ( ntax + 1 ) ]
+ Tree_list [ ( ntrees +
1 ) ]
+ Char_list [ ( nchar + 1 ) ]
+ Matrix [ (ntax +1 )
, (nchar +1 ) ]
;
When using the square-bracket notation
for declaration, the commas are optional.
Declaring the variables as multidimensional automatically
sets up the arrays in a proper way; changing the value for the arrays can
"disassemble" the matrix and make further accesses difficult or
impossible. To prevent this, it is
possible to have the macros check whether all the accesses to arrays are legal
(and then permitted only if the variable has been declared as such). Without protection, the user is free to
access and change any values (e.g. to set up his own arrays). The protection level is set to N with macro
prot N.
Floating point printing.- By default, all the
numbers in user variables are stored as floating-point. When printing the values, you can use
different numbers of significant digits.
Note that many operations under scripting involve (internally) expanding
the numbers and then parsing the strings, so that changing the number of
significant digits will change the precision with which many routines are
executed.
The command macfloat N;
sets the number of digits to be N (if
0, the printed numbers look like integers).
The option macfloat e; uses exponential
notation (not readable by commands, useful only to print out tables).
Storing the values as floating point
uses 64 bits per value; if you want to use only 32 bits (integer calculations,
to save memory), issue a macfloat-;
command before turning the macros on.
Internal variables.- The user
can access a number of internal variables (i.e. number of taxa, characters,
etc.). You can get a list of those
entering help+;at the command prompt. Internal variables can be accessed only in
certain contexts in the macro language.
The general rule is that user variables can be accessed from any command
(by enclosing their number or name in single quotes), while internal variables
can be accessed only by the macro commands (loop, if,
set,
var, etc.).
Note.- The internal variables are
a key component of the TNT macro language; if accessing other internal
variables, besides the ones already included, could be useful to some users,
get in touch with PAG and ask him about the possibility of including those
variables in TNT (given the design of the macro language, it is really easy to
add new internal variables).
Flow control.- The flow control is established with the if
and loop
commands. There are two possible
syntaxes for the if command; one is:
if ( condition
)
action(s)
end
where "condition" is any valid expression,
and "action(s)" is any set of valid TNT commands. If the condition is "satisfied"
(i.e. evaluates to non-zero), then action(s) are executed; otherwise, all
actions before the end are skipped (note that TNT interprets its scripts as it
reads them, so that the action(s) may themselves contain syntax errors, which
won't be discovered if the action(s) are skipped; the errors will only be
discovered when the actions are executed).
The alternative is
if ( condition
)
action(s)
A ;
else
action(s)
B ;
end
which will execute action(s) A if the condition is
satisfied, action(s) B otherwise. Each if
must always be matched by else or end; each else
must always be matched by end. Nested if commands are legal,
without restrictions as to the number of nestings.
The other command for flow control is
loop. The syntax is,
loop i+j k action(s)
stop
the numbers i, j, and k
can be numbers, variables, or expressions. This will repeatedly execute
action(s), k – i + j / j times. If the symbol '#' followed by the number of
loop (see below) is included in action(s), it is replaced by the value for the
iteration being done. Thus, the symbol
will the first time be replaced by i, incrementing in
subsequent iterations by j, until value exceeds value k (if i
is greater than k, the loop is decreasing, and j is taken to be negative, even
if preceded by +).
Nested loops are possible (the maximum
nesting is determined with macro*). To refer to the value for each loop, use the
'#' followed by the corresponding number.
Each loop must always be matched by a stop. Within the loop, the commands endloop, setloop n,
and continue,
are recognized (endloop ends the loop, setloop resets the value of the loop to
n, and continue skips all the
action(s) between the current point and the stop that ends the loop).
Essentially any action can be done from
within the loops, except denaming user variables. The loops use RAM from the macros, so if some
long loop cannot be executed under certain conditions, switch the macro option
to off and increase the RAM for macros.
Return values.- When a file with instructions is closed, it
may be desirable for it to have some kind of return value (having it to store
the exit value in a user variable may be cumbersome and make the file less
flexible). The command return
N; closes the input file and sets the internal variable exstatus to the value N.
Recursive calls.- Input files with
instructions can be called recursively.
To call a file from itself, use the recurse
command (followed, if desired, by arguments, just like used for the run
command). This (together with the
declaration of private variables) allows to define quite complex routines. The number of times a file can call itself
with recurse depends on the
maximum number of input files (default = 10); the number of input files allowed
is changed with mxproc (the file "exactsol.run" does this change automatically). An
example is below, in the files "exactsol.run"
and "addtax.run." That finds an exact solution for any number
of taxa (it uses a complete down-pass optimization for each partial tree, so it
is rather inefficient). File "exactsol.run" must contain:
silent =all ;
set 0 ntax + 2 ;
mxproc '0' ; /*
make sure you have enough input calls */
var = 0 attax
1 bound
2 wkdone ; /* attax , bound , and wkdone are
private */
set bound
10000000; /* use a big initial bound */
keep 0 ;
tread ( 0 ( 1 2
) ) ; /* set an initial network */
set attax 3 ; /* attax = next taxon to add */
report- ;
addtax ; /* this is the "recursive" function */
progress/ ;
report= ;
tc 1. ;
silent - all ;
proc/ ;
File "addtax.run" must contain:
set 0 'attax' ;
set 1 ntax + 1 ;
private 2 ;
loop 1 nnodes [ 0 ]
if (
#1 == '1' || !isintree [ 0 #1 ] )
continue
;
end
if (
'attax' == 4 )
set
wkdone ++ ;
progress
(('wkdone'*100)/15) 100 Working ; /* progress */
end
edit ] 0
'1' 'attax' #1 ; /* attach new taxon */
set 2
length [ 0 ] ;
if (
'2' <= 'bound' )
if
( 'attax' < ntax )
set
attax ++ ;
recurse ; /*
call yourself !! */
set
attax -- ;
else
if
( '2' < 'bound' ) /*
update bound */
keep 1 ;
set bound '2' ;
end
if
( '2' == 'bound' && ( ( ntrees + 1 ) < maxtrees ) )
copytree 0 ; /* save multiple trees */
end
end
end
pruntax 0 / 'attax' ; /* deattach taxon */
stop
proc/;
Note that the C
style comments in the two example files are perfectly valid.
Goto.- The instruction goto
filename tag [arguments] parses file filename starting from the point labeled tag (with label tag). The file can be the same file, or a different
one. This can be used to have related
routines in the same file (a sort of "library"). A file can be defined as the default file to
parse, with goto = filename. Subsequent calls to goto must
specify only the tag (and arguments,
if any).
Branch-swapping.- In a way similar to the loop command, the
commands sprit and tbrit initiate a
loop, with the difference that in each iteration, the tree given as argument N
is changed (under spr or tbr
branch-swapping):
sprit N action(s)
stop
Action(s) can be any
valid TNT command, plus additional commands recognized only from within sprit
or tbrit:
continue skip remaining actions and do next
rearrangement
endswap
end the swapping (if never saved, tree N is unmodified)
resetswap save
current rearrangement and restart swapping
While within a sprit
or tbrit loop, the internal
variable percswap equals the
percentage of rearrangements done to complete swapping on current tree. Together with the progress command, this can
be used to report partial progress of the macro instructions.
Sprit or tbrit can be
nested with loops, but no sprit or tbrit
can be nested within a sprit or tbrit.
Defining dialogs
(Windows only).- It is possible to define templates for dialog
boxes, with the command opendlg. Names for files can be written onto variables
by using the getfname command. See
online help for syntax; the file "dialog.run"
(included in the self-extracting file zipdruns.exe) contains an example, which
emulates the dialog box for Analyze/TraditionalSearch.
A more elaborate script, "bremer.run",
allows calculating bremer supports using various
options.
This section provides citations for
the sources of some methods used in TNT.
If you use the method in a published paper, you should (besides citing
TNT itself) cite the proper source for the method.
Bremer, K. 1990.
Combinable component consensus.
Cladistics 6:369-372 (as one might guess from the title, this paper
describes the combinable component consensus).
Bremer, K. 1994.
Branch support and tree stability.
Cladistics 10:295-304 (a formal description of the "bremer support").
Coddington, J., and N. Scharff.
1994. Problems with zero-length
branches. Cladistics 10:415-423 (a
discussion of the problems associated with tree-collapsing, and proposal of
"rule 4").
De Laet, J. 1997. A reconsideration of three-item analysis, the
use of implied weights in cladistics, and a practical application in Gentianaceae. Ph.D.
Thesis, Katholieke Universiteit
Leuven, Faculteit der Wetenschappen,
Leuven, Belgium, 214 pp. (this contains discussion of implied weighting in
general, and it discusses the idea of recoding additive characters in binary form
for implied weighting).
Farris, J., V. Albert, M. Källersjö,
D. Lipscomb, and A. Kluge. 1996.
Parsimony jackknifing outperforms neighbor joining.
Cladistics 12:99-124 (this paper describes the idea of resampling
characters with independent probability of removal, to eliminate influence of
uninformative or irrelevant characters, as well as the idea of superficial
searches for consensus estimations).
Farris, 1970.
Methods for computing Wagner trees.
Syst. Zool. 19:83-92 (the paper describing the first formal approach to
Wagner trees and optimization for additive and continuous characters).
Farris, J. 1990.
Phenetics in camouflage. Cladistics 6:91-100 (a discussion of problems
associated with some methods to discretize continuous characters in phylogenetic
analysis).
Felsenstein, J. 1985.
Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791 (the original paper
proposing to use the bootstrap on phylogenies).
Fitch, 1971.
Toward defining the course of evolution: minimal change for a specific
tree topology. Syst. Zool. 20:406-416
(this paper describes optimization for non-additive characters).
Goloboff, P. 1993.
Estimating character weights during tree search. Cladistics 9:83-91 (description of the method
of "implied weights").
Goloboff, P. 1994.
Character optimization and calculation of tree lengths. Cladistics 9:433-436 (description of the
basic algorithm for faster calculation of tree lengths during searches; instead
of making T-2 node comparisons for each tree of T taxa, the algorithm requires
making only one node comparison per tree).
Goloboff, P. 1996.
Methods for faster parsimony analysis. Cladistics 12:199-220 (this paper
describes the "two pass incremental optimization" which is the basis
for calculations in TNT. It also
describes the "qcollapse" algorithm for
pre-collapsing of trees during searches).
Goloboff, P. 1998.
Tree searches under Sankoff parsimony.
Cladistics 14:229-237 (this paper describes calculation of tree lengths for
Sankoff characters; the method used in TNT is a combination of the exact method
described in this paper, with the incremental optimization described in
Goloboff, 1996).
Goloboff, P. 1999. Analyzing
large data sets in reasonable times: solutions for composite optima. Cladistics
15:415-428 (description of tree-drifting, sectorial searches, tree-fusing, and
the idea of searching until the consensus stabilizes).
Goloboff, P. 2002. Optimization of polytomies: state
set and parallel operations. Molecular Phylogenetics and Evolution 22:269-275
(a description of the method used in TNT to optimize polytomies for
non-additive characters).
Goloboff, P. 2007.
Calculating SPR-distances between trees.
Cladistics (in press).
Goloboff, P., C. Mattoni,
and S. Quinteros. 2005. Continuous characters analyzed as such.
Cladistics 22: 589-601. (a description of the
method used to optimize continuous characters on polytomous trees, and a
description of the implementation of continous
characters in TNT; see also Farris, 1970).
Goloboff, P., and D. Pol. 2002. Semi-strict
supertrees. Cladistics 18:514-525 (description of the supertree method used in
TNT, and some cautionary remarks about the use and interpretation of
supertrees).
Goloboff, P., and J. Farris. 2001. Methods for
quick consensus estimation. Cladistics 17:S26-S34 (description of the methods
to estimate consensuses, the relative bremer support,
and tree collapsing using branch swapping).
Goloboff, P., J. Farris, M. Källersjö,
B. Oxelmann, M. Ramirez, and C. Szumik. 2003. Improvements to resampling measures of group
support. Cladistics 19:324-332 (description of symmetric resampling, frequency
differences, and frequency slopes).
Maddison, W. 1989. Reconstructing character evolution on
polytomous cladograms. Cladistics 5: 365-377. (the
paper describing the method used to calculate length of "soft"
polytomies).
Nixon, K. 1999.
The parsimony ratchet, a new method for rapid parsimony analysis.
Cladistics 15:407-414 (the original description of the ratchet, already a
citation classic).
Hendy, M, and D. Penny. Branch and bound algorithms to determine
minimal evolutionary trees. Mathematical Biosciences 59:277-290 (description of
the implicit enumeration algorithm, used for solutions guaranteed to find all
trees of best score).