I am writing a manual computation memoization system (ugh, in Matlab). The straightforward parts are easy:
- A way to put data into the memoization system after performing a computation.
- A way to query and get data out of the memoization.
- A way to query the system for all the 'keys'.
These parts are not so much in doubt. The problem is that my computer has a finite amount of memory, so sometime the 'put' operation will have to dump some objects out of memory. I am worried about 'cache misses', so I would like some relatively simple system for dropping the memoized objects which are not used frequently and/or are not costly to recompute. How do I design that system? The parts I can imagine it having:
- A way to tell the 'put' operation how costly (relatively speaking) the computation was.
- A way to optionally hint to the 'put' operation how often the computation might be needed (going forward).
- The 'get' operation would want to note how often a given object is queried.
- A way to tell the whole system the maximum amount of memory to use (ok, that's simple).
The real guts of it would be in the 'put' operation when you hit the memory limit and it has to cull some objects, based on their memory footprint, their costliness, and their usefulness. How do I do that?
Sorry if this is too vague, or off-topic.
I'd do it by creating a subclass to DYNAMICPROPS that uses a cell array to store the data internally. This way, you can dynamically add more data to the object.
Here's the basic design idea:
The data is stored in a cell array. Each property gets its own row, with the first column being the property name (for convenience), the second column a function handle to calculate the data, the third column the data, the fourth column the time it took to generate the data, the fifth column an array of, say, length 100 storing the timestamps corresponding to when the property was accessed the last 100 times, and the sixth column contains the variable size.
There is a generic get method that takes as input the row number corresponding to the property (see below). The get method first checks whether column 3 is empty. If no, it returns the value and stores the timestamp. If yes, it performs the computation using the handle from column 1 inside a TIC/TOC statement to measure how expensive the computation is (which is stored in col4, and the timestamp is stored in col5). Then it checks whether there is enough space for storing the result. If yes, it stores the data, otherwise it checks size, as well as the product of how many times data were accessed with how long it would take to regenerate, to decide what to cull.
In addition, there is an 'add' property, that adds a row to the cell array, creates a dynamic property (using addprops
) of the same name as the function handle, and sets the get-method to myGetMethod(myPropertyIndex)
. If you need to pass parameters to the function, you can create an additional property myDynamicPropertyName_parameters
with a set method that will remove previously calculated data whenever the parameters change value.
Finally, you can add a few dependent properties, that can tell how many properties there are (# of rows in the cell array), how they're called (first col of the cell array), etc.
Consider using Java, since MATLAB runs on top of it, and can access it. This would work if you have marshallable values (ints, doubles, strings, matrices, but not structs or cell arrays).
LRU containers are available for Java:
- Easy, simple to use LRU cache in java
- http://java-planet.blogspot.com/2005/08/how-to-set-up-simple-lru-cache-using.html
- http://www.codeproject.com/KB/java/lru.aspx