I'm using make to control the data flow in a statistical analysis. If have my raw data in a directory ./data/raw_data_files
, and I've got a data manipulation script that creates cleaned data cache at ./cache/clean_data
. The make rule is something like:
cache/clean_data:
scripts/clean_data
I do not want to touch the data in ./data/
, either with make, or any of my data munging scripts. Is there any way in make to create a dependency for the cache/clean_data that just checks whether specific files in ./data/
are newer than last time make ran?
If
clean_data
is a single file, just let it depend on all data files:If it is a directory containing multiple cleaned files, the easiest way is to write a stamp file and have that depend on your data files:
Note that this regenerates all
clean_data
files if one data file changes. A more elaborate approach is possible if you have a 1-to-1 mapping between data and cleaned files. The GNU Make Manual has a decent example of this. Here is an adaptation:Here, we use wildcard to get a list of all files under
data
. Then we replace the data path with the cache path using patsubst. We tellmake
how to generate cache files via a static pattern rule, and finally, we define a targetall
which generates all the required cache files.Of course you can also list your
CACHEFILES
explicitly in the Makefile (CACHEFILES:= cache/clean_data/a cache/clean_data/b
), but it is typically more convenient to letmake
handle that automatically, if possible.Notice that this complex example probably only works with GNU Make, not in Windows' nmake. For further info, consult the GNU Make Manual, it is a great resource for all your Makefile needs.