Git Worktrees + DataLad: The Missing Link Between Daily Development and Reproducible Analysis

Prelude: Scientist in a data labyrinth As experimental neuroscientist in training, I often find myself caught between two worlds: the messy, exploratory world of data analysis where I try to make sense of the experimental data, often relying on a trial-and-error strategy, and the aspired structured, reproducible world of scientific publication where hopefully every figure will be exactly reproducible. Between these worlds lies a labyrinth of processing pipelines, half-written scripts, and the ever-present risk of breaking the working analysis while trying to improve it. The wandering in the labyrinth is rarely straight-forward – one is expected to hit many dead ends and discover other interesting distractions before finding the actual treasure – the key results that hopefully lead to a scientific publication or other forms of consolidated knowledge piece. I try to illustrate this metaphoric labyrinth that is my non-metaphoric reality in a diagram: ...

2025-12-08 · 22 min · 4505 words · Jiameng Wu
A screenshot of DataLad-Registry web UI

DataLad-Registry: Bringing Benefits of Centrality to DataLad

DataLad provides a platform for managing and uniformly accessing data resources. It also captures basic provenance information about data results within Git repository commits. However, discovering DataLad datasets or Git repositories that DataLad has operated on can be challenging. They can be shared anywhere online, including popular Git hosting platforms, such as GitHub, generic file hosting platforms such as OSF, neuroscience platforms, such as GIN, or they can even be available only within the internal network of an organization or just one particular server. We built DataLad-Registry to address some of the problems in locating datasets and finding useful information about them. (For convenience, we will use the term “dataset”, for the rest of this blog, to refer to a DataLad dataset or a Git repo that has been “touched” by the datalad run command, i.e. one that has “DATALAD RUNCMD” in a commit message.) ...

2024-12-06 · 9 min · 1706 words · Isaac To, Austin Macdonald, Yaroslav O Halchenko
A screenshot of https://hub.datalad.org/hcp-openaccess, and the Forgejo, git-annex, and DataLad logos on top.

Hosting really large datasets with Forgejo-aneksajo

One scenario where DataLad shines is managing datasets that are larger than what a single Git repository can deal with. The combination of git-annex’s capabilities to separate Git hosting from data hosting in extremely flexible ways with DataLad’s approach to orchestrating collections of nested repositories as a joint “mono repo” is the foundation for that. One example of such a large dataset is the WU-Minn HCP1200 Data, a collection of brain imaging data, acquired from more than a thousand individual participants by the Human Connectome Project (HCP). The DataLad Handbook has an article on how the HCP1200 DataLad dataset was created a few years ago. However, the focus of this blog post is not how it was created, but rather how and where it is hosted. ...

2024-08-27 · 7 min · 1296 words · Michael Hanke
Screenshot of a video page of the dataset described in this post as hosted at https://hub.datalad.org/distribits/recordings, and the FFmpeg, HTCondor, git-annex, and DataLad logos on top.

Fairly big video workflow

Two years ago, my colleagues published FAIRly big: A framework for computationally reproducible processing of large-scale data. In this paper, they describe how to partition a large analysis (their example: processing anatomical images of 42 thousand subjects from UK Biobank), using DataLad to provision data and capture provenance, so that individual results can be reproduced on a laptop, even though a cluster is needed to run the entire group analysis. The article is accompanied by a workflow template and a tutorial dataset. ...

2024-08-16 · 20 min · 4076 words · Michał Szczepanik

Collecting runtime statistics and outputs with `con-duct` and `datalad-run`

One of the challenges that I’ve experienced when attempting to replicate the execution of data analysis is quite simply that information regarding the required resources is sparse. For example, when submitting a SLURM job, how does one know the wallclock time to request, much less memory and CPU resources? To solve this problem we at the Center for Open Neuroscience have created a new tool, con-duct aka duct to easily collect this information. When combined with datalad-run, duct collects crucial runtime information for future replication and reuse. ...

2024-08-09 · 3 min · 547 words · Austin Macdonald