I’ve been thinking about sustainable and accessible archiving of bioinformatics software, I’m pretty scandalized at the current state of affairs, and had a bit of a complain about it before. I thought I’d post some links to other people’s ideas and talk a bit about the situation and action that is needed right now.
Casey Bergman wrote an excellent blog post (read the comments too) and created the BioinformaticsArchive on GitHub. There is a Storify of tweets on this topic.
Hilmar Lapp posted on G+ on the similarity of bioinformatics software persistence to the DataDryad archiving policy implemented by a collection of evolutionary biology journals. That policy change is described in a DataDryad blog post here: http://blog.datadryad.org/2011/01/14/journals-implement-data-archiving-policy/ and the policies with links to the journal editorials here http://datadryad.org/pages/jdap
The journal Computers & Geosciences has a code archiving policy and provides author instructions (PDF) for uploading code when the paper is accepted.
So this is all very nice, many people seem to agree its important, but what is actually happening? What can be done? Well Casey has led the way with action rather than just words by forking public GitHub repositories mentioned in article abstracts to BioinformaticsArchive. I really support this but we can’t rely on Casey to manage all this indefinitely, he has (aspirations) to have a life too!
What I would like to see
My thoughts aren’t very novel, others have put forward many of these ideas:
1. A publisher driven version of the Bioinformatics Archive
I would like to see bioinformatics journals taking a lead on this. Not just recommending but actually enforcing software archiving just as they enforce submission of sequence data to GenBank. A snapshot at time of publication is the minimum required. Even in cases where the code is not submitted (bad), an archive of the program binary so it can actually be found and used later is needed. Hosting on authors’ websites just isn’t good enough. There are good studies of how frequently URLs cited in the biomed literature decay with time [cite source=’pubmed’]17238638[/cite] and the same is certainly true for links to software. Use of the standard code repositories is what we should expect for authors, just as we expect submission of sequence data to a standard repository not hosting on the authors’ website.
I think there is great merit to using a GitHub public repository owned by a consortium of publishers and maybe also academic community representatives. Discuss. An advantage of using a version control system like GitHub is that it would apply not too subtle pressure to host code rather than just the binary.
2. Redundancy to ensure persistence in the worst case scenario
Archive persistence and preventing deletion is a topic that needs careful consideration. Casey discusses this extensively; authors must be prevented from deleting the archive either intentionally or accidentally. If the public repository was owned by the journals’ “Bioinformatics Software Archiving Consortium” (I just made up this consortium, unfortunately it doesn’t exist) then authors could not delete the repository. Sure they could delete their own repository, but the fork at the community GitHub would remain. It is the permanent community fork that must be referenced in the manuscript, though a link to the authors’ perhaps more up to date code repository could be included in the archived publication snapshot via a wiki page, or README document.
Perhaps this archive could be mirrored to BitBucket or similar for added redundancy? FigShare and DataDryad could also be used for archiving, although it would be suboptimal re-inventing the wheel for code. I would like to see FigShare and DataDryad guys enter the discussion and offer advice since they are experts at data archiving.
3. The community to initiate actual action
A conversation with the publishers of bioinformatics software needs to be started right now. Even just PLOS, BMC, and Oxford Journals adopting a joint policy would establish a critical mass for bioinformatics software publishing. I think maybe an open letter signed by as many people as possible might convince these publishers. Pressure on Twitter and Google+ would help too, as it always does. Who can think of a cool hashtag? Though if anyone knows journal editors an exploratory email conversation might be very productive too. Technically this is not challenging, Casey did a version himself at BioinformaticsArchive. There is very little if any monetary cost to implementing this. It wouldn’t take long.
But can competing journals really be organised like this? Yes, absolutely for sure, there is clear precedent in the 2011 action of >30 ecology and evolutionary biology journals. Also, forward-looking journals will realize it is their interests to make this happen. By implementing this they will seem more modern and professional by comparison to journals not thinking along these lines. Researchers will see strict archiving policy as a reason to trust publications in those journals as more than just ephemeral vague descriptions. These will become the prestige journals, because ultimately we researchers determine what the good journals are.
So what next? Well I think gathering solid advice on good practice is important, but we also need action. I’d discussions with the relative journals ASAP. I’m really not sure if I’m the best person to do this, and there may be better ways of doing it than just blurting it all out in a blog like this, but we do need action soon. It feels like the days before GenBank, and I think we should be ashamed of maintaining this status quo.
8 Comments Add yours
Great post Dave, there certainly seems to be a critical mass of bioinformaticians concerned about the code that gets produced.
We have to understand why software becomes unavailable – essentially because the grant / studentship ran out. Is it sustainable for the “community” to keep it all going? I’m not so sure.
One thing that would help would be if funding agencies recognized that maintaining software is as important (and as fundable, and as REF-able) as writing new software. People mention the BBR fund from BBSRC, but it is too small and too often funds the wrong thing.
Did you see Dan Zerbino’s comment on my blog? http://biomickwatson.wordpress.com/2012/12/28/an-embargo-on-short-read-alignment-software/#comment-21 <- refactoring Velvet code (not only maintaining it but improving it) was systematically rejected from journals as being of little importance. What a truly awful decision by the editors of those journals.
Like anything in life, if we want something to happen, then we we need to motivate people. We need to reward bioinformaticians (in real terms – grants, papers and REF-able output) for maintaining the code that is already out there. Do that, and things will fall into place.
Many thanks Mick. I totally agree with you, persistence is really much more than archiving a copy, its also maintenance (and improvement). I think that support for project maintenance is really needed by funding agencies. I found the comments you got very interesting. I agree with Daniel Zerbino “The journal/review process is simply not adapted to this type of work”. We need to modify it to get proper recognition. Maybe the social nature of GitHub (number of forks, pull requests etc) can be used; I think ImpactStory can track github activity. Something is needed that is more than the traditional Victorian manuscript system we seem to have. But it is a self-reinforcing problem. If code is not archived publicly it can’t have its own DOI or URL, and then interactions and updates and use can’t be properly tracked.
I don’t think some investigators understand the value of what they and their team have created. They know it’s worth creating, but have no idea how to curate and grow it. I am working with two teams where disaster recovery backups and source control management are not part of the daily workflow of all the team members. Guiding them out of these habits is a delicate matter.
It’s an area where the granting agencies could apply more pressure to share the code publicly so the community can preserve what they find useful. The resource sharing plans recommend open licensing of software but open is relative. This allows a wide range of resource sharing from none at all to tight fisted control and yes, at times, complete openness.
The NIH has funded major open source software packages that do not have public code repositories, do not accept updates, and do not maintain public issue tracking. It’s unclear to me if the less open licenses have preserved investigators’ interests in maintaining software or stifled innovation on those packages.
Thanks Philip, I agree. Actually there are a lot of confusing licences for “open” out there, but I really hope that a community view of what is acceptable will prevail and make it easier for people to know what is good practice. If you spoke to NIH, or the equivalent UK funding agencies in the UK for me, I bet they wouldn’t really know what specifically needs to be enforced or why, though they might know the phrasing of their regulations. It would be great to start that discussion and lead them there.
As one of the end users I can only agree with what you say. I personally think public repositories are good especially if it includes download stats and comments on the software. That would make life easier for me, it´s a jungle out there. When I did my PhD (genetics) in the 90s we used Staden for seqeunce assembly (I still do), great tool that I think lost funding so there is an issue with shortsightedness from funding agencies.
Thanks Jan. Yes some very poor decisions were made in the funding of Staden, a real lack of vision by funding agencies. The fact that it is still a great tool today tells you how far ahead of its time it was. Download stats and comments would be a great thing, its not just about users – software authors should get credit via these stats for archiving the analysis tools.