A blog on data management and DataLad

A world map with DataLad minions and connected nodes
A screenshot of https://hub.datalad.org/hcp-openaccess, and the Forgejo, git-annex, and DataLad logos on top.

Hosting really large datasets with Forgejo-aneksajo

One scenario where DataLad shines is managing datasets that are larger than what a single Git repository can deal with. The combination of git-annex’s capabilities to separate Git hosting from data hosting in extremely flexible ways with DataLad’s approach to orchestrating collections of nested repositories as a joint “mono repo” is the foundation for that. One example of such a large dataset is the WU-Minn HCP1200 Data, a collection of brain imaging data, acquired from more than a thousand individual participants by the Human Connectome Project (HCP)....

2024-08-27 · 7 min · 1296 words · Michael Hanke
Screenshot of a video page of the dataset described in this post as hosted at https://hub.datalad.org/distribits/recordings, and the FFmpeg, HTCondor, git-annex, and DataLad logos on top.

Fairly big video workflow

Two years ago, my colleagues published FAIRly big: A framework for computationally reproducible processing of large-scale data. In this paper, they describe how to partition a large analysis (their example: processing anatomical images of 42 thousand subjects from UK Biobank), using DataLad to provision data and capture provenance, so that individual results can be reproduced on a laptop, even though a cluster is needed to run the entire group analysis. The article is accompanied by a workflow template and a tutorial dataset....

2024-08-16 · 20 min · 4076 words · Michał Szczepanik

Collecting runtime statistics and outputs with `con-duct` and `datalad-run`

One of the challenges that I’ve experienced when attempting to replicate the execution of data analysis is quite simply that information regarding the required resources is sparse. For example, when submitting a SLURM job, how does one know the wallclock time to request, much less memory and CPU resources? To solve this problem we at the Center for Open Neuroscience have created a new tool, con-duct aka duct to easily collect this information....

2024-08-09 · 3 min · 547 words · Austin Macdonald
A screenshot of Forgejo action runner status and result page, with the Forgejo, podman, and systemd logos on top.

Operate a runner for Forgejo actions with podman and systemd

This article is part three of a series on self-hosting Forgejo-aneksajo. If you have not read part one, and part two already, check them out. In many ways, this article is a direct continuation. If you are self-hosting a Forgejo instance already, it can make a lot of sense to also operate a runner for its actions. Forgejo’ actions will feel very familiar to anyone who has used Github’s actions. That being said, the Forgejo documentation clearly states that “they are not and will never be identical”....

2024-08-06 · 10 min · 2055 words · Michael Hanke
A screenshot of `systemd status` for a container process, and the Forgejo, podman, and systemd logos on top.

Deploying and managing Forgejo-aneksajo with podman and systemd

In a previous article, I described my first steps with Forgejo-aneksajo, and deploying it on my Raspberry Pi. This got me so excited that I started looking into deploying it on more machines. However, I quickly realized that the Docker-based approach I had used originally was not going to yield a setup that I would see myself wanting to maintain for an extended period of time. I knew I wanted something that I could completely apt install from Debian, that is maintained there as a whole, and not pieced together from only loosely connected components....

2024-07-31 · 16 min · 3272 words · Michael Hanke