I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise
and R.cache
, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash
package, which doesn't seem to underpin the two memoization packages.
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
It looks like the options are:
Hashing
- digest - provides hashing for arbitrary R objects.
Memoization
- memoise - a very simple tool for memoization of functions.
- R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
Caching
- hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
Key/value storage
These are basic options for external storage of R objects.
Checkpointing
- cacher - this seems to be more akin to checkpointing.
- CodeDepends - An OmegaHat project that underpins
cacher
and provides some useful functionality. - DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
Other
- Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also
pairlist
, but it is deprecated.) - The data.table package supports rapid lookups of elements in a data table.
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
- Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
- Memoization of monstrous calculations.
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher
and R.cache
), but there is no elaboration on usage options.
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
- Dirk Eddelbuettel:
digest
- a lot of other packages depend on this. - Roger Peng:
cacher
,filehash
,stashR
- these address different problems in different ways; see Roger's site for more packages. - Christopher Brown:
hash
- Seems to be a useful package, but the links to ODG are down, unfortunately. - Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
-- it's not yet clear when to prefer one package over the other.
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
Related to @biocyperman solution. R.cache has a wrapping function for avoiding the loading, saving and evaluation of the cache. See the modified function:
R.cache provide a wrapper for loading, evaluating, saving. You can simplify your code like that:
I did not have luck with
memoise
because it gavetoo deep recursive
problem to some function of a packaged I tried with. WithR.cache
I had better luck. Following is more annotated code I adapted fromR.cache
documentation. The code shows different options to do caching.For simple counting of strings (and not using
table
or similar), a multiset data structure seems like a good fit. Theenvironment
object can be used to emulate this.