Monolithic pipelines are common in bioinformatics and particularly for metabarcoding. My view is that the word pipeline, and the type of software it refers to, may be holding us back and should be rethought.
What is a pipeline?
Pipelines are connected sets of programs, where information flows through the linked analysis algorithms as water flows through pipes. Some parts will trim DNA sequence data, some filter, some clip and some demultiplex. Others are involved in determining the taxonomic ID by comparison to a reference database. At the end there is a result, and this is often an estimation of taxonomic ID for each sequence. Lots of decisions are being made about the optimal way to analyse this data, each associated with distinct steps in the analysis, many decisionanot being made by me. My problems with pipelines are the way they persuade me to abandon being a rigourous scientist and take the easy solution. Why should I accept that your pipeline is a thing, rather than lots of things, each to be optimised for my specific conditions?
Pipelines should not be our focus
The issue here though is that the pipeline is really a summary of all the analysis steps it contains and its output a summary of these. Yet your pipeline is a thing, its a publication, it has a name, its arranged to be used as it is, its take it or leave it. When people speak about how they analysed the data they say “I use BlahPipe, it told me A and B were different”.
It’s analagous to comparing the means of several complex data sets, it’s not wrong but misses out on the distribution of the data, it misses out on experimental evaluation of our analysis. Each program in the pipeline has default parameters for filtering, clustering and identifying the sequences. These are well-tested to be reasonable parameters in many situations. How often are they optimized for the dataset being analyzed? Rarely.
My specific complaint is that the way we use pipelines currently de-emphasizes the individual programs, makes optimization of their parameters difficult and prevents us from an experimental approach to data analysis.
Do one thing, and do it well
The Unix philosophy of small programs that can be chained together, each doing one thing and doing it well, fits beautifully with the world of bioinformatics data analysis. In the Unix data analysis world however the commandline instructions are typically small, clear and accessible.
The Unix philosophy DOTADIW can sound quite religious to me at times, but it is useful. It’s rightly criticized on several counts; proponents do not really define what “one things means, nor “well”. The logical conclusion is not that we should have an infinite number of “programs” doing infinitely small tasks, rather for me it’s about functional modularity. It doesn’t matter that the program
cat can both concatenate files and display them, just that its typical use is to do one thing. It then leaves the next substantial task for the next independent element of the workflow. Similarly it doesn’t matter to me that
seqkit can both filter and rewrite the format of files. I don’t care that it can do multiple jobs, it’s just that each action is a single swappable job in a workflow. If I want
CD-HIT to deduplicate sequences before
seqkit converts them that is trivial to change in my workflow. It’s not ‘use this software for all-or-nothing’, rather each program does one thing, then the same or different program does the next thing, then the next. I don’t much care about definitions in the philosophy, I just want it to work flexibly, without other people making the decisions for me.
This idea of functional modularity with programs passing along information seamlessly, is of course also a key part of the unix philosophy, it’s just not repeated as much as DOTADIW.
Regrets, I’ve had a few
I’ve learned a lot by doing things wrong. Lets leave it at that shall we?
Lets use Workflows instead
I’m now thinking more of workflows than pipelines.
The very word pipeline implies a fixed route for the contents, oil flows from the pump to the refinery, and nowhere else. This is a bad metaphor for a complex and ever-changing scientific data analysis optimisation.
A workflow however is more complex, less one-dimensional, and encourages consideration of the components. Workflows differ in that there is an implicit flexibility, an implicit need to design a workflow. Using Workflow can start a quiet revolution against black box software.
Is this just sematics?
Yes, but that can be useful and important, not trivial. I think that use of the very word ‘pipeline’ has subtely moved bioinformatics science in the wrong direction. Recognising this is an easy fix.
I’m now trying to use the word ‘workflow’ and not ‘pipeline’, and I’m trying to be much more suspicious of software that doesn’t make it easy for me to tinker and optimise, as they probably don’t understand the science I need to do. Neither do I, thats why I need to explore rather than move my data in a 1-dimension pipeline from crude to ‘refined’ oil/data.