Frictionless reproducibility in phylogenomic experiments

AmirSzitenbergThis is a guest post by Amir Szitenberg, postdoctoral researcher in my group at the University of Hull, and main author of @ReproPhylo

I find the ReproPhylo approach to experimental phylogenomics very exciting, and can see how it would lead to better, in depth understanding of phylogenomic datasets, regardless of their size. An example for this is described in mine and Dave’s preprint, written together with Max John and Mark Blaxter. As for the implementation of reproducibility tools in ReproPhylo, they are meant to be completely the opposite: quiet and unexciting. Something that allows us to focus on the good stuff. However, since this is my first ever blog post on ReproPhylo, I will focus here on reproducibility aspects of the programme. The ReproPhylo environment has reproducibility features that are built into the python module, others that are the benefit of ReproPhylo’s Git integration, and an extra layer of reproducibility is gained by distributing ReproPhylo as a Docker image.

ReproPhylo on its own (well, with its python module dependencies)…

Any pipeline that uses scripts is a step forward toward reproducibility. However, in addition to having a concise scripting syntax, ReproPhylo achieves several additional goals.

Explicit provenance

Provenance (information about the data inheritance chain that produced a certain result) is a tough one. It is very easy to produce a very nice tree, but then have doubts regarding which version of the sequence alignment produced it. ReproPhylo circumvents the problem. A single item, the Project class object, contains all the datasets, including inputs, intermediate and outputs, with IDs associating them with the process that produced them. If you have this object in hand, you have provenance. You do not have to match the tree file to the alignment, or to the sequence version used in it. Reporting methods will make the relationships among the Project components clear to the user.

Persistence, repeatability, and extendibility

The Project object is automatically and continuously pickled. ReproPhylo uses the cloud python module, with its ‘pickling’ functionality, to save the Project object as a binary file. ReproPhylo will update the pickle file whenever an action is taken (eg, sequences are aligned). This file secures the persistence of the analysis as it can always be read again to review or continue the analysis, or to access the data in it. ReproPhylo utilizes commonly used phylogenetic analysis programmes (see manual) that can be easily and flexibly controlled with built in functions. However, since data are always maintained as standard Biopython and ETE classes (SeqRecord, MultipleSeqAlignment and Tree) they can be accessed without ReproPhylo, and can be directly plugged into pipelines utilizing these modules, thus releasing/ reading data to or from any programme that is not yet integrated. The original data objects, nested within the Project, can be tweaked or utilized in place, or instead, there are Project methods to produce copies to work with, keeping the original as is. Interfacing without dependency on Biopython and ETE can be achieved by Project methods that produce or read text files in any of the formats compatible with Biopython and ETE.

Figures and tables

ReproPhylo stores the metadata of the sequences, as read from GenBank files or CSVs. As the metadata appears there, so it will on your trees. There is reduced opportunity for human error in the transference of metadata from data bins to trees or built in analyses and steps (e.g. trait matrix for Bayestraits), as long as it was fed in correctly initially. Species names will be the same as in the original GenBank record, or as set in the metadata spreadsheet, no matter if you decide to add them to your figure post hoc. Need to change a species name or some other metadata? Change your spreadsheet, the change will propagate to your figures.

Reproducible publications

ReproPhylo, upon request will produce an archive file containing everything needed for publishing the analysis. It will contain the sequences and metadata as a GenBank file, the trees and alignments as a PhyloXML file, the figures, and a report providing detailed descriptions of the methods and data composition.

With Git integration there’s also…

Provenance integrity – the final nail

Git works quietly at the background. ReproPhylo will record a version of the pickle file any time it is updated, as well as of input files, scripts and Jupyter notebook files. Built in Project methods allow to view the Git commits and to revert the Project to other versions than the current. Toggling between, say, tree versions in this way, cannot damage provenance information, because the whole Project containing the tree will be toggled along with it, preserving prevenance.

Symplified publication of complex pipelines

A consequence of using Git to allow toggling between Project versions is the creation of a Git repository ( a .git directory). ReproPhylo makes sure not to interfere with pre-existing Git repositories and ones that do not explicitly belong to the current Project. This Git repository is all you would need to publish your workflow as it can be pushed to GitHub and then given a FigShare DOI, using FigShare’s Git integration. This way of publishing your analysis is a more direct and cuts out the middleman which is supplementary files.

Working with Docker (or other virtualization solutions)

Unlike Git, Docker is not integrated in ReproPhylo, just used as a method of distribution. However, the combination of the ReproPhylo Docker image and the Git repository produced by your analysis are as close as one can get to ultimate reproducibility, as far as I can see. The Docker container is the environment in which the analysis was done, and all the challenges of recreating this environment do not exist anymore. This combination of a Git repository and a Docker container, eliminates hours of setting up a high quality reproducible publication and of installation and configuration for a reader interested in repeating and/or extending the analysis, and makes them a thing of the past. Since ETE and matplotlib communicate with the host OS X11-server to produce graphics, some steps are required to couple the X11-server in the docker container with that of the host OS. This may seem to stand in the way of a slick installation process, but solution to this is simple in Linux OSs, and comes in the form of a shell script that manages all the steps starting with the image pulling and ending with serving Jupyter notebook to the local default web browser. However, Docker is not the ultimate solution on all platforms. In OSX and WIndows, there is no simple solution, since Docker operates as a virtual machine in these operating systems. For Windows, I solved this by forfeiting containerization. Instead, ReproPhylo is distributed as a WinPython self contained version. Just download, extract, and fire-up the Jupyter notebook. In OSX, I lean towards the other direction, replacing the Docker image with a full scale Ubuntu VM image (for full installation details see the manual). I would have loved to have a single distribution which is truly cross-platform and seamlessly installs on any machine, but this doesn’t seem likely to happen in the near future.


The tools for easy and reliable reproducibility exist. It is putting them together and configuring them for our needs, as ReproPhylo attempts to do, which might take some time. However, the time put into it is undoubtedly regained when these tools are routinely taken advantage of.

ReproPhylo webpageReproPhylo githubReproPhylo manual

@ReproPhylo twitter

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s