Is there a predefined location where an R package could store cached data? The data should persist across sessions. I was thinking about creating a subdirectory of ${R_LIBS_USER}/package_name
, but I'm not sure if this is portable and if this is "allowed" if my package is installed systemwide.
The idea is the following: Create an R script mydata.R
in the data
subdirectory of the package which would be executed by calling data(mydata)
(according to the documentation of data()
). This script would load the data from the internet and cache it, if it hasn't been cached before. (If the data has been cached already, the cache will be used.) In addition, a function will be provided to invalidate the cache and/or to check if a newer version of the data is available online.
This is from the documentation of data()
:
Currently, four formats of data files are supported:
files ending ‘.R’ or ‘.r’ are source()d in, with the R working directory changed temporarily to the directory containing the respective file. (data ensures that the utils package is attached, in case it had been run via utils::data.)
...
Indeed, creating a file fortytwo.R
in the data
subdirectory of a package with the following contents:
fortytwo = data.frame(answer=42)
and then executing data(fortytwo)
creates a data frame variable fortytwo
. Now the question is: Where would fortytwo.R
cache the data if it were difficult to compute?
EDIT: I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it. The question concerns the "data" package: Where can it store files in a per-user storage so that it is persistent across R sessions and is accessible from different R projects?
Related: Package that downloads data from the internet during installation.
Have you looked at in-memory databases? H2 & Redis have bindings in R via RH2 & rredis- both allow you to share the data across r sessions- till the creating session is alive. in order to have it persisting across non-concurrent sessions, you need to write your data to the disk (assuming you can't re-create it on the fly- that would defeat the purpose of this question), and I believe the data package would be a good option. That way, you could add an update function that initializes everytime you load either package (i.e. if the code package has the right dependencies)
An example is RWeka & RWekaJars packages. Look them up on CRAN, and it should be fairly easy to understand how they work.
There is no absolutely defined location for package-specific persistent caching in R. However, the R.cache package provides an interface for creating and managing cached data. It looks like it could be useful for your scenario.
When users load R.cache (
library(R.cache)
), they get the following prompt:They can then choose to create the cache directory in their home directory, which is presumably persistent, or to create a session-specific directory. If you make your data package depend on R.cache, you could check for the existence of the cached object(s) in its
.onLoad()
hook function and download the data if it isn't there. Alternatively, you could do this in the way suggested in your own question.