Sourced from http://biorxiv.org/content/early/2017/03/29/122077.
Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro
Review posted on 08th May 2017
This paper introduces the squeakr system for exact and inexact k-mer counting.
The paper is well written and I have been able to obtain and execute
See full review here.
Krzysztof J. Gorgolewski, Fidel Alfaro-Almagro, Tibor Auer, Pierre Bellec, Mihai Capota, Mallar Chakravarty, Nathan W. Churchill, R. Cameron Craddock, Gabriel Devenyi, Anders Eklund, Oscar Esteban, Guillaume Flandin, J. Swaroop Guntupalli, Mark Jenkinson, Anisha Keshavan, Gregory Kiar, Pradeep Reddy Raamana, David Raffelt, Christopher J. Steele, Pierre-Olivier Quirion, Robert E. Smith, Stephen Strother, Gael Varoquaux, Tal Yarkoni, Yida Wang, Russell Poldrack
Review posted on 25th January 2017
This is a well written and seemingly comprehensive paper about the idea of using containerization technology (Docker & Singularity) and a slightly custom framework to distribute/provide apps for neuroimaging.
The writeup is well done, and with two exceptions, we have no major comments.
The major exception is that the Singularity discussion needs to be revisited. Right now it is somewhat too aspirational and does not clearly articulate the (major) drawbacks of singularity; we had to read between the lines and do some digging on our own to figure out where the failure points were.
A few critical aspects to Singularity are either not mentioned or glossed over:
* it seems you still need root access on HPCs to *install* singularity containers.
* the Docker-to-Singularity conversion approach is very inconvenient looking, and we feel it is a major drawback.
* the imposition of "read-only" mode on container execution is understandable but again should be highlighted as inconvenient.
* Singularity *may* be installable *in theory* on many HPCs, but we don't have any idea of its adoption in practice. This could be addressed by an explicit comment that it's still early days but that at least the situation is likely to be better than it is with Docker.
We think these points need to be made more clearly in the paper.
Our second major concern - depending on the Docker Hub for archiving and versioning is not a good idea, and (at the very least) some sort of caveat should be applied and some longer-term directions suggested.
Minor correction -- 'particiapant_label' is misspelled.
C. Titus Brown
Stephen R Piccolo, Adam B Lee, Michael B Frampton
Review posted on 17th July 2015
In this paper, Piccolo et al. do a nice (and I think comprehensive?) job of outlining six strategies for computational reproducibility. The point is well made that science is increasingly dependent on computational reproducibility (and that in theory we should be able to do computational reproducibility easily and well) and hence we should explore effective approaches that are actually being used.
I know of no other paper that covers this array of material, and this is a quite nice exposition that I would recommend to many. I can't evaluate how broadly it will appeal to a diverse audience but it seems very readable to me.
The following comments are offered as helpful suggestions, not criticisms -- make of them what you will.
The paper almost completely ignores HPC. I'm good with that, but it's a bit surprising (since many computational scientists seem to think that reproducible orchestration of many processors is an unachievable task). Noted in passing.
I was somewhat surprised by the lack of emphasis of version control systems. These are really critical in programming for ensuring reproducibility. I also found a missing citation! You should look at journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 (yes, sorry, I'm on the paper).
Speaking of which, I appreciate the completeness of references (and even the citation of my blog post ;) but it would be interesting to see if Millman and Perez have anything to offer: http://www.jarrodmillman.com/oss-chapter.html. Certainly a good citation (I think you hit the book, but this is a particularly good chapter.)
I would suggest (in the section that mentions version control systems, ~line 170 of p9) recommending that authors "tag" specific versions for the publication, even if they later recommend using updated versions. (Too many people say "use this repo!" without specifying a revision.)
The section on literate programming could usefully mention that these literate programming environments do not offer good mechanisms for long running programs, so they may not be appropriate for things that take more than a few minutes to run.
Also, and perhaps most important, these literate programming environments provide REPL and can thus track exploratory data analysis and "harden" it when it works and the author moves onto another data analysis - so even if the authors don't want to clean up their notebook before publication, you can track exactly how they got their final results. I think this is important for practical reproducibility. I don't know quite what to suggest in the context of the paper but it seems like an important point to me.
Both the virtual machine and container sections should mention the challenges of raw data bundling, which is one of the major drawbacks here - not only is the VM large, but (unless you are partnering with e.g. Amazon to "scale out") you must distribute potentially large data sets. I think this is one of the biggest practical issues facing data intensive sciences. (There was a nice commentary recently by folk in human genomics begging the NIH to make human genomic data available via the cloud; I can track it down if the authors haven't seen it.)
I think it's important to emphasize how transparent most Dockerfiles are (and how this is a different culture than the VM deployment scene, where configuration systems are often not particularly emphasized except in the devops community). I view this as one of the most important cultural differences driving container adoption, and for once it's good for science!
The docker ecosystem also seems quite robust, which is important, I think.
[ ... specific typos etc omitted ... ]