I'm not aware of a built-in solution in snakemake. Maybe here's how I would go about it.
Say your input data is data.txt
. This is the file(s) that is overwritten possibly without changing. Instead of using this file directly in the snakemake rules
, you use a cached copy that is overwritten only if the md5 has changed between original and cache. The checking can be done before the rule all
using standard python code.
Here's a pseudo-code example:
input_md5 = get_md5('data.txt')
cache_md5 = get_md5('cache/data.txt')
if input_md5 != cache_md5:
# This will trigger the pipeline because cache/data.txt is newer than output
copy('data.txt', 'cache/data.txt')
rule all:
input:
'stuff.txt'
rule one:
input:
'cache/data.txt',
output:
'stuff.txt',
EDIT: This pseudo-code saves the md5 of the cached input files so they don't need to be recomputed each time. It also saves to file the timestamp of the input data so that the md5 of the input is recomputed only if such timestamp is newer than the cached one:
for each input datafile do:
current_input_timestamp = get_timestamp('data.txt')
cache_input_timestamp = read('data.timestamp.txt')
if current_input_timestamp > cache_input_timestamp:
input_md5 = get_md5('data.txt')
cache_md5 = read('cache/data.md5')
if input_md5 != cache_md5:
copy('data.txt', 'cache/data.txt')
write(input_md5, 'cache/data.md5')
write(current_input_timestamp, 'data.timestamp.txt')
# If any input datafile is newer than the cache, the pipeline will be triggered
However, this adds complexity to the pipeline so I would check whether it is worth it.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…