One of the challenges that I’ve experienced when attempting to replicate the execution of data analysis is quite simply that information regarding the required resources is sparse. For example, when submitting a SLURM job, how does one know the wallclock time to request, much less memory and CPU resources?
To solve this problem we at the Center for Open
Neuroscience have created a new tool, con-duct
aka
duct
to easily collect this information.
When combined with datalad-run
, duct
collects crucial runtime information for future replication
and reuse.
Demo
To show off duct
, lets use the DataLad-101 and LongNow podcast examples from the DataLad handbook
Clone the repository, and populate the longnow podcasts.
git clone git@github.com:datalad-handbook/DataLad-101.git
cd DataLad-101
git checkout -b duct-demo
datalad clone --dataset . https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow
Now let’s install con-duct
:
pip install con-duct
We now have our podcasts, and just like the datalad run section of the handbook we can generate a list of titles, except this time we will prepend the command with a duct
wrapper to collect runtime statistics.
Note: we must use --quiet
since we are using >
to capture the output of the script, we don’t want to pollute it with duct
output.
$ datalad run -m "Use datalad run with duct to create a list of podcast titles, and capture runtime information" \
"duct --sample-interval 0.01 --report-interval 0.1 --quiet bash code/list_titles.sh > recordings/recordings.tsv"
run(ok): /home/austin/devel/DataLad-101 (dataset) [duct --sample-interval 0.01 --report-int...]
add(ok): .duct/logs/2024.07.30T09.02.01-110957_info.json (file)
add(ok): .duct/logs/2024.07.30T09.02.01-110957_stderr (file)
add(ok): .duct/logs/2024.07.30T09.02.01-110957_stdout (file)
add(ok): .duct/logs/2024.07.30T09.02.01-110957_usage.json (file)
add(ok): recordings/recordings.tsv (file)
save(ok): . (dataset)
In addition to the recordings.tsv
we also have a set of files describing the execution.
$ ls .duct/logs
2024.07.30T09.02.01-110957_info.json
2024.07.30T09.02.01-110957_stderr
2024.07.30T09.02.01-110957_stdout
2024.07.30T09.02.01-110957_usage.json
We’ve captured the output of the command (which in this example was captured anyway with >
):
$ cat .duct/logs/*_stdout
2017-06-09 How Digital Memory Is Shaping Our Future Abby Smith Rumsey
2017-06-09 Pace Layers Thinking Stewart Brand Paul Saffo
2017-06-09 Proof The Science of Booze Adam Rogers
... <snip>
If there were errors, we’ve captured them as well, but in this case we just have an empty file.
cat .duct/logs/*_stderr # should be empty
We also have a summary of the environment and statistics about the entire run. This information may be particularly helpful during replication– the replicator will now have crucial information to estimate the resources they will need!
$ cat .duct/logs/*_info.json | jq
{
"command": "bash code/list_titles.sh",
"system": {
"uid": "austin",
"memory_total": 33336778752,
"cpu_total": 20
},
"env": {},
"gpu": null,
"duct_version": "0.1.0",
"execution_summary": {
"exit_code": 0,
"command": "bash code/list_titles.sh",
"logs_prefix": ".duct/logs/2024.07.30T09.02.01-110957_",
"wall_clock_time": "0.611 sec",
"peak_rss": "3680 KiB",
"average_rss": "3543.704 KiB",
"peak_vsz": "223344 KiB",
"average_vsz": "215072.000 KiB",
"peak_pmem": "0.0%",
"average_pmem": "0.000%",
"peak_pcpu": "0.0%",
"average_pcpu": "0.000%",
"num_samples": 27,
"num_reports": 7
}
}
For more fine-grained information (especially useful to generate graphs), we also have a collection of reports to indicate usage over time.
Each report can be an aggregation of 1 or more samples, it shows the resource utilization of each process, and the total of all related processes.
$ cat .duct/logs/*_usage.json | jq
{
"timestamp": "2024-07-30T09:02:01.524299-05:00",
"num_samples": 1,
"processes": {
"110960": {
"pcpu": 0.0,
"pmem": 0.0,
"rss": 3680,
"vsz": 223344,
"timestamp": "2024-07-30T09:02:01.524299-05:00"
}
},
"totals": {
"pmem": 0.0,
"pcpu": 0.0,
"rss_kb": 3680,
"vsz_kb": 223344
},
"averages": {
"rss": 3680,
"vsz": 223344,
"pmem": 0.0,
"pcpu": 0.0,
"num_samples": 1
}
}
... <snip>