Reproducible phylogenetics part 2b; what

Previously I wrote about (1) why we need reproducibility in phylogenetics, (2) what we need to achieve it. This is part 2b, still writing about what we need to achieve reproducibility. My conclusion before was:

“that most of the issues surrounding reproducible phylogenetics are solved problems in other disciplines. The things that are still challenging are not about achieving reproducibility but about achieving it easily, irrespective of computational experience, such that reproducibility becomes the default behaviour.”

Phylogenetic reproducibility is an emerging discipline, and in addition to technical issues in achieving reproducibility (discussed in the previous post) there are other challenges:

Avoiding post hoc claims of reproducibility

Credit is sometimes taken for partial or accidental reproducibility, without a clear reproducible design or test of reproducibility. “I have provided [some key data] that many studies do not usually share, therefore my work is reproducible”. But has your reproducibility been planned, considered, and optimised? Where is the test of your intended reproducibility? This post hoc claim isn’t bad, it is better than average, but we should not allow this unplanned, unevaluated reproducibility to become default behaviour.

Reproducibility and Reusability

We need our science to be reproducible, but also reusable. In addition to the possibility of reproducing the work, we need to make it so that it is easy to reproduce, easy to modify, and easy to reanalyse in different ways. If we are going to do reproducible phylogenetics we should make it useful, i.e. reusable by other scientists, else we have achieved little. Reusability is important.

What do we need to make phylogenetics reusable?

tl;dr We need the ability to reproduce in a reasonable time frame with no more than just a reasonable amount of effort. Standard data formats, original software and dependencies, and analysis instructions, are best wrapped into a connected set of instructions called a “pipeline”.

Pipelines: wrap the workflow in a script

The use of analysis pipelines automatically achieves most of the things we need for reusable and reproducible phylogenetics. The pipeline can call programs in the right sequence, to analyse the correct data files, using the correct parameters, and save outputs in the correct format. All settings for the analysis should be de facto recorded, and an experimental record or log file of the entire analysis can be automatically written. This should make the replication of results a ‘one click’ task, and simple modifications of the original analysis will require only changing a parameter.

Use standard data formats

I saw someone write that there would be a special place in hell for people inventing new formats for sequence data! Perhaps a bit strong, but I estimate that I have used 1 year of my working life swapping between data file formats. I think this is an accurate estimate, it’s not a joke. Often this has required using 3 separate programs, each offering to read/write in different variants, to get the format into my final application. Things are better now with BioPerl, BioPython etc processing pretty much all sequence file formats, but they still require standards. Standard data formats are a serious matter, do not invent new ones, do not use modifications of standard formats that work in one program only. Instead reject that program and email the author to explain why.

Provide the original software and all its dependencies

It is unfortunate but true that we must include the original software in a reproducible experiment (Morrison 2013; Lunt 2013; Bergman 2012). The reasons why include both reproducibility and reusability. Software, and particularly old versions of software, goes extinct. The author leaves science, or stops having funding to maintain the software, the webpage stops working, and that software is no longer available. Journals do not archive software when they publish the manuscript although we should push for this. Even if software did not go extinct, providing the software aids reproducibility since there is then no question of the software version used. Lastly the distribution of software as part of the experimental package greatly increases reusability since the workflow is bundled together with no need to scour the web for correct versions of the analysis programs used.

Much of this it seems is most easily achieved by saving a virtual machine or similar. A VM is not required for reproducibility but there is only a certain amount of time and effort before we give up, so a bundled version of the environment and data is very helpful (I would say essential) to real world reproducibility.

What is best practice in phylogenetic data storage?

The best practice is to retain ALL information in a phylogenetic analysis. There have been a number of articles and posts suggesting the minimal or ideal information for phylogeneticists to record. Forcing the user to choose which data to retain or discard is introducing friction and error to the process. Friction is the enemy of science, easy things get done, frustrating things don’t even if they are important. Ideally reproducibility-best-practice is something that would just happen without user intervention, and omit nothing.

We could learn a lot from computer backup strategies. Lots has been written on this, and a very powerful message is: automate the backup of all your files. If you have to remember to backup, if you have to make time, if you have to choose, you won’t do it well enough or often enough. Backup of ALL data is something that should happen very regularly in the background as default behaviour in both phylogenetics experiments and life.

What is best practice in phylogenetic data sharing?

This is slightly different from above. Sharing involves easy archiving in open public repositories such that users can access and reproduce. For example a zip file for easy upload to FigShare, Dryad or similar. In future maybe this could happen frictionlessly with phylogenetic scripts archiving the data directly using the database’s APIs.

What is best practice in phylogenetic environment sharing?

The analysis may be very reproducible indeed on my computer, but if you are missing a crucial program dependency, or have a different version of Python, or some other part of the environment, it may be impossible to replicate the experiment or even run the scripts. Maybe we need to store and share the computer environment in which the analysis was run. This can be done via virtual machines (VM) which save the environment (eg whole Linux system) and allow you to run it from within whatever operating system you are working. Docker is similar to a VM but “lighter” and doesn’t require you to install the entire operating system, just the parts required. There is a lot of excitement around container systems (like Docker) in computing at the moment, this is not a technology likely to disappear. Currently I think best practice for reproducibility is to provide a Docker container with ALL scripts data and computational environment to run them exactly as they did locally. This environment is not static of course, the new user can update the version of software as normal, and then compare any changes to the original.

Next…

In the next post I’m going to talk about some options that already exist to implement reproducible phylogenetics.

Bergman C 2012 On the Preservation of Published Bioinformatics Code on Github. Does Casey’s blog have a name? https://caseybergman.wordpress.com/2012/11/08/on-the-preservation-of-published-bioinformatics-code-on-github/

Lunt DH 2013 How can we ensure the persistence of analysis software? EvoPhylo blog http://www.davelunt.net/evophylo/2013/03/software-persistence/

Morrison D 2013 Archiving of bioinformatics software. The Genealogical World of Phylogenetic Networks blog http://phylonetworks.blogspot.co.uk/2013/07/archiving-of-bioinformatics-software.html