How to visualize a phylogeny with thousands of tips?

What abilities should a phylogenetic visualisation tool have? What is important when you have so many tips (OTUs) that it is too big to print out or even scroll through on the screen? I have several pieces of research in this last category. In no particular order here are some things that seem important to me-

  1. It should still be “snappy” when dealing with tens of thousands of OTUs. I think it should be standalone not web-based for tasks like this.
  2. It should be open-source with an active development community. Can we really keep relying on single program authors for development? No
  3. It must interact with an associated data file. This data file can be common to a number of trees. It could be parsed from GenBank and keep ALL field data plus user data. This data file is essential for data-driven OTU renaming, searching, collapsing and exporting
  4. It should collapse OTUs to groups from an associated data file and name these groups. ie automatically group OTUs into “mammalia”, “rotifera”, “arthropoda”, “diptera”. Collapse and name options could be parsed from GenBank taxonomy. See GRUNT.
  5. It should be able to collapse nodes automatically to form polytomies. These could be clades below a given support value, or below a certain node length.
  6. It should be able to reroot. User-defined clicking on an OTU or clade, midpoint rooting (default)
  7. It should be able to test for monophyly of groups. It could colour these groups accordingly. So if all descendent taxa of a node are called mammalia in the taxonomy file then the group is labeled “mammalia”. If another mammal is found outside of mammalia clade then it is flagged as non-monophyletic.
  8. Should be able to see both the details and the whole picture. At the least click to zoom in and out . So maybe an inset of where in the tree one is and a clickable interface to go somewhere else, is vital. See Rod Page’s ideas on visualisation of large trees on a web page.
  9. It needs to have search facilities. These should be able to search tree and associated data files. Boolean. Find this text string in these fields AND this in that.
  10. User definable tip names. It should be easy to switch between different tip names (taken from the data file), such as accession number, species name, etc etc. Should be able to apply rules to this; if this and that then name tip like this.
  11. It must be able to export reliably, in all tree formats, with appropriately considered tip names etc. As graphics with SVG, PDF, EMF etc supported. Exported graphics must be available in collapsed format too.
  12. It should be scriptable. Its very useful to have the ability to be incorporated in bioinformatics pipeline. So “program open treefile, collapse according to this datafile and criteria, rename tips according to this, export as SVG”.

Am I asking a lot? Not really, all this can be implemented with current code, people just don’t in general. Any suggestions for more? Any stuff you don’t agree with?

Many programs claim to deal with hundreds or thousands of tips on a tree. My cichlid mtDNA tree has approx 4000 OTUs. The NJ tree would, if printed out, fill more than 40 pages. There are several programs that can deal with this and feel reasonably fast, but it is almost impossible to get a meaningful look at the phylogenetic relationships. Too much data on the screen, I can’t see the wood for the trees. It is essential to be able to collapse down the hundreds of almost identical mtDNA sequences coming from Lake Victoria fish and just label the resulting triangle “Victoria Superflock”. Immediately I can start to see their relationship to others without an enormous amount of scrolling. The datafile would allow me to have this done across the tree with taxonomic names. Imagine a big tree of birds presorted into orders, and labeled accordingly! Immediately you would be able to see whats going on and begin the actual biological interpretation of your data.

There are 2 or 3 programs I am aware of that (almost) do all the above. In other posts I will discuss them, and how I’m currently using them for large scale phylogenetics and informatics. My favourites at the moment are ARB and Treedyn. There is a list of tree viewers at the Treedyn site that seems quite good, perhaps getting a little old now though.

I’ll describe my thoughts on current software, pros and cons, and “the future” in an upcoming posting.

4 Comments Add yours

  1. Hilmar says:

    Dave – have you also looked at Dendroscope (http://www-ab.informatik.uni-tuebingen.de/software/dendroscope/)?Unfortunately, it is *not* open source, but it is free to use for academics.

    Like

  2. Dave Lunt says:

    Thanks Hilmar, yes I know Dendroscope. I quite like it, its certainly one of the best “standard” treeviewers out there. Its nicely made and some parts are quite forward looking. But it doesn’t quite do it for me. Imagine that I have a tree made in a standard program, the names of the OTUs are constrained to be 10 characters in some old-fashioned (phylip-like) packages! Even if they are more relaxed than this, and my names can be “Genus species accession_number” what if I want to see the distribution of short sequences on the tree, or which were generated by which published studies? If the treeviewer can interact with a database then ANYTHING in the GenBank record can be plotted on the tree (sequence length, first author, full taxonomic name, collection locality etc). Great! Many tests and future experiments immediately spring to mind when I have a phylogenetic framework and lots of data (characters) at the tips. I don’t think Dendroscope can interact with outside data sources can it?One of the most useful things to do first with big trees would be to label the groups by proper taxonomic names, exactly as shown in the Dendroscope paper figure 1 (doi:10.1186/1471-2105-8-460). But this is because they have loaded up the GenBank taxonomy which necessarily includes all the internal node names. A normal phylogeny would just have tip names. How then do you label hundreds or thousands of internal nodes?Well, there are perl scripts to do these things, but my ideal treeviewer for the future would go out of its way to interact with other data sources. Its no so very hard to do if each OTU is linked to a GenBank record with its taxonomic position recorded.Thanks for the comment by the way, its great to get into some discussions on these issues. I really like to hear what other people think. Maybe I’m way off target.I have a couple (maybe 3) potential solutions. I’ll write them up when I’ve finished marking these exam scripts!

    Like

  3. Mike says:

    Interacting with external datasources would be great. I think something like what you’re describing is being accomplished by Greengenes http://greengenes.lbl.gov/cgi-bin/nph-index.cgi and Silva http://www.arb-silva.de/ and their links with ARB. Though there is a disconnection between the databases and the phylogeny package, and accessing the sequence meta-data through ARB is cumbersome. Admittedly they are prokaryotic-centric. From a microbial ecology viewpoint there’s a huge advantage to using a curated database rather than Genbank given that Genbank accession numbers aren’t really unique and the sequence quality can’t be guaranteed.

    Like

  4. Dave Lunt says:

    Thanks Mike, they are both great resources. I wish there were eukaryote-studying people doing such great work. ARB is one of my favourite applications (though it frustrates me too). I wonder if it would be possible to interact with online databases like SILVA or Greengenes over the wire rather than the download it approach ARB takes. Maybe too much data, maybe not.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s