NOTE: This extension is currently work-in-progress!
This extension equips DataLad with the functionality to (re)compute file
content on demand, based on a specified set of instructions. In particular,
it features a datalad make
command for capturing instructions on how to
compute a given file, allowing the file content to be safely removed. It also
implements a git-annex special remote, which enables the (re)computation of
the file content based on the captured instructions. This is particularly
useful when the file content can be produced deterministically. If storing
the file content is more expensive than (re)producing it, this functionality
can lead to more effective resource utilization. Thus, this extension may be
of interest to a wide, interdisciplinary audience, including researchers,
data curators, and infrastructure administrators.
This extension requires Python >= 3.9
. It also requires GPG to be installed
as well as a GPG key-pair to sign and verify commits. In addition,
git has to be configured to sign commits. For more information on how to sign
commits, refer to the
Git documentation.
There is no PyPI package yet. To install the extension, clone the repository
and install it via pip
(preferably in a virtual environment):
> git clone https://github.com/datalad/datalad-remake.git
> cd datalad-remake
> pip install -r requirements-devel.txt
> pip install .
To check your installation, run:
> datalad make --help
Ensure that your commits are signed (see the Git documentation), and create a dataset:
> datalad create remake-test-1
> cd remake-test-1
Create a template and place it in the .datalad/make/methods
directory:
> mkdir -p .datalad/make/methods
> cat > .datalad/make/methods/one-to-many <<EOF
parameters = ['first', 'second', 'output']
command = [
"bash",
"-c",
"echo content: {first} > '{output}-1.txt'; echo content: {second} > '{output}-2.txt'",
]
EOF
> datalad save -m "add 'one-to-many' remake method"
Before the computation can be executed, datalad-make
has to be told to trust
the public key of the signer. How this is done is described in the section
Trusted Keys.
Execute a computation and save the result:
> datalad make -p first=bob -p second=alice -p output=name -o name-1.txt \
-o name-2.txt one-to-many
The method one-to-many
will create two files with the names <output>-1.txt
and <output>-2.txt
. Thus, the two files name-1.txt
and name-2.txt
need to
be specified as outputs in the command above.
> cat name-1.txt
content: bob
> cat name-2.txt
content: alice
DataLad REMAKE can recompute dropped content. To demonstrate this, we will
drop a file and then recreate it via datalad get
.
Drop the content of name-1.txt
, verify it is gone, and recreate it via
datalad get
, which "fetches" it from the datalad-remake
remote. Note: the
datalad-remake
remote was automatically created by the command datalad make
.
> datalad drop name-1.txt
> cat name-1.txt
> datalad get name-1.txt
> cat name-1.txt
The datalad make
command can also be used to perform a prospective
computation.
The prospective computation can be initiated by using the
--prospective-execution
option:
> datalad make -p first=john -p second=susan -p output=person \
-o person-1.txt -o person-2.txt --prospective-execution one-to-many
The following command will fail, because no computation has been performed, and the file content is not yet available:
> cat person-1.txt # this will fail, because the computation has not yet been performed
We can further inspect person-1.txt
with git annex info
:
> git annex info person-1.txt
Similarly, git annex whereis
will show the URL, that can be handled by the
git-annex special remote:
> git annex whereis person-1.txt
Finally, datalad get
can be used to produce the file content (for the first
time!) based on the specified instructions:
> datalad get person-1.txt
> cat person-1.txt
content: john
Please note, to use this feature, the following configuration variable
remote.datalad-remake-auto.annex-security-allow-unverified-downloads
is set
to ACKTHPPT
for each automatically created git-annex special remote.
Why does the configuration variable have to be set?
This setting allows git-annex to download files from the special remote datalad-remake
although git-annex cannot check a hash to verify that the content is correct.
Because the computation was never performed, there is no hash available for content
verification of an output file yet.
For more information see the description of
remote.<name>.annex-security-allow-unverified-downloads
and of
annex.security.allow-unverified-downloads
at
https://git-annex.branchable.com/git-annex/.
Additional examples can be found in the examples directory.
By default, datalad-remake
will only perform "trusted"
computations. That holds for the direct execution via datalad make
as well as
for the indirect execution via the git-annex special remote as a result of
datalad get
. A computation is trusted, if the method and the parameters
that define the computation are trusted.
A method is considered "trusted" if the last commit to the method template is signed by a trusted key.
Parameters, i.e. input, output, and method-parameter values, are initially
provided in the datalad make
command line. If the datalad make
command
executes successfully, they will be associated with the output files of the
datalad make
command. These associations are done via a commit to the dataset
and a call to git annex addurl
. Parameters are considered "trusted" if:
- they are provided by the user via the
datalad make
command line, or - they were associated with a file in a commit that is signed by a trusted key.
Signature validation is performed by git verify-commit
, which uses GPG to
perform the cryptographic processes. To successfully verify a signature, the
signer's public key must be added to the active GPG-keyring. To indicate to
datalad make
that the signer should be trusted, the key-id of the signer's
public key must be added to
the git configuration variable datalad.make.trusted-keys
. To ensure that you
have control about trusted keys, datalad-remake will not
read this variable from the repository configuration, but only
from git global variables, from git system variables, or from the command
itself (via the option -c
).
A trusted key could, for example, be added by executing the following command:
> git config --global --add datalad.make.trusted-keys <key-id>
If more than one key should be defined as trusted, the configuration variable
datalad.make.trusted-keys
can be set to a comma-separated list of key-ids,
e.g.:
> git config --global --add datalad.make.trusted-keys <key-id-1>,<key-id-2>,...,<key-id-n>
The key-id can be obtained via gpg --list-keys --keyid-format long
. The key
id is the part after the /
in the pub
line. For example, in the following
output:
> gpg --list-keys --keyid-format long
/tmp/test_simple_verification0/gpg/pubring.kbx
--------------------------------------------------------------------------
sec rsa4096/F1B64364FF34DDCB 2024-10-28 [SCEAR]
F6AC1EE006B3E2D0805DA103F1B64364FF34DDCB
uid [ultimate] Test User <test@example.com>
the key id is F1B64364FF34DDCB
. To inform datalad make
and the git-annex
special remote that this key is trusted, the following command could be used:
> git config --global --add datalad.make.trusted-keys F1B64364FF34DDCB
For instructions how to sign commits, see the Git documentation.
See CONTRIBUTING.md if you are interested in internals or contributing to the project.
This development was supported by European Union’s Horizon research and innovation programme under grant agreement eBRAIN-Health (HORIZON-INFRA-2021-TECH-01-01, grant no. 101058516).