I'm using make to control the data flow in a statistical analysis. If have my raw data in a directory ./data/raw_data_files
, and I've got a data manipulation script that creates cleaned data cache at ./cache/clean_data
. The make rule is something like:
cache/clean_data:
scripts/clean_data
I do not want to touch the data in ./data/
, either with make, or any of my data munging scripts. Is there any way in make to create a dependency for the cache/clean_data that just checks whether specific files in ./data/
are newer than last time make ran?
If clean_data
is a single file, just let it depend on all data files:
cache/clean_data: data/*
scripts/clean_data
If it is a directory containing multiple cleaned files, the easiest way is to write a stamp file and have that depend on your data files:
cache/clean_data-stamp: data/*
scripts/clean_data
touch cache/clean_data-stamp
Note that this regenerates all clean_data
files if one data file changes. A more elaborate approach is possible if you have a 1-to-1 mapping between data and cleaned files. The GNU Make Manual has a decent example of this. Here is an adaptation:
DATAFILES:= $(wildcard data/*)
CACHEFILES:= $(patsubst data/%,cache/clean_data/%,$(DATAFILES))
cache/clean_data/% : data/%
scripts/clean_data --input $< --output $@
all: $(CACHEFILES)
Here, we use wildcard to get a list of all files under data
. Then we replace the data path with the cache path using patsubst. We tell make
how to generate cache files via a static pattern rule, and finally, we define a target all
which generates all the required cache files.
Of course you can also list your CACHEFILES
explicitly in the Makefile (CACHEFILES:= cache/clean_data/a cache/clean_data/b
), but it is typically more convenient to let make
handle that automatically, if possible.
Notice that this complex example probably only works with GNU Make, not in Windows' nmake. For further info, consult the GNU Make Manual, it is a great resource for all your Makefile needs.