A blog on data management and DataLad

A world map with DataLad minions and connected nodes

Screenshot of Nextcloud web interface showing some folders, two of them shared, and a sharing pane where cloneable-dataset has a share link set to view only. Nextcloud, WebDAV, and git-annex logos are overlaid on top of the screenshot.

Putting new git-annex features to use with Nextcloud

Git-annex continues to evolve. In this post, I want to look at two changes, one big and one small, introduced within the last year. Together, they make publishing files through Nextcloud much nicer. Specifically, it is now possible for a read-only shared Nextcloud folder to be a one-stop shop for cloning the dataset and getting file contents. This can be a useful setup for sharing (research) data: having the shared folder be a single point of access is convenient, and restricting write access is necessary to prevent unauthorized changes....

Collaborative infrastructure for a lab: Forgejo

For the past 18 years I have been a GitHub user. It has been an extremely convenient platform for collaborating with many people from all over the world. What makes GitHub, and other platforms like it, particularly attractive is that they are typically way more accessible than any institutionally provided infrastructure (even if not without issues of its own). GitHub also provided an extremely reliable and stable infrastructure that encouraged and rewarded building on it....

DataLad-Registry: Bringing Benefits of Centrality to DataLad

DataLad provides a platform for managing and uniformly accessing data resources. It also captures basic provenance information about data results within Git repository commits. However, discovering DataLad datasets or Git repositories that DataLad has operated on can be challenging. They can be shared anywhere online, including popular Git hosting platforms, such as GitHub, generic file hosting platforms such as OSF, neuroscience platforms, such as GIN, or they can even be available only within the internal network of an organization or just one particular server....

A screenshot of https://hub.datalad.org/hcp-openaccess, and the Forgejo, git-annex, and DataLad logos on top.

Hosting really large datasets with Forgejo-aneksajo

One scenario where DataLad shines is managing datasets that are larger than what a single Git repository can deal with. The combination of git-annex’s capabilities to separate Git hosting from data hosting in extremely flexible ways with DataLad’s approach to orchestrating collections of nested repositories as a joint “mono repo” is the foundation for that. One example of such a large dataset is the WU-Minn HCP1200 Data, a collection of brain imaging data, acquired from more than a thousand individual participants by the Human Connectome Project (HCP)....

Screenshot of a video page of the dataset described in this post as hosted at https://hub.datalad.org/distribits/recordings, and the FFmpeg, HTCondor, git-annex, and DataLad logos on top.

Fairly big video workflow

Two years ago, my colleagues published FAIRly big: A framework for computationally reproducible processing of large-scale data. In this paper, they describe how to partition a large analysis (their example: processing anatomical images of 42 thousand subjects from UK Biobank), using DataLad to provision data and capture provenance, so that individual results can be reproduced on a laptop, even though a cluster is needed to run the entire group analysis. The article is accompanied by a workflow template and a tutorial dataset....