Caching Matlab function results to file

2019-02-24 22:53发布

问题:

I'm writing a simulation in Matlab. I will eventually run this simulation hundreds of times. In each simulation run, there are millions of simulation cycles. In each of these cycles, I calculate a very complex function, which takes ~0.5 sec to finish. The function input is a long bit array (>1000 bits) - which is an array of 0 and 1. I hold the bit arrays in a matrix of 0 and 1, and for each one of them I only run the function once - as I save the result in a different array (res) and check if the bit array is in the matrix before running the functions:

for i=1:1000000000
    %pick a bit array somehow
    [~,indx] = ismember(bit_array,bit_matrix,'rows');
    if indx == 0
        indx = length(results) + 1;
        bit_matrix(indx,:) = bit_array;
        res(indx) = complex_function(bit_array);
    end
    result = res(indx)
    %do something with result
end

I have two quesitons, really:

  1. Is there a more efficient way to find the index of a row in a matrix then 'ismember'?

  2. Since I run the simulation many times, and there is a big overlap in the bit-arrays I'm getting, I want to cache the matrix between runs so that I don't recalculate the function over the same bit-arrays over and over again. How do I do that?

回答1:

The answer to both questions is to use a map. There are a few steps to do this.

  1. First you will need a function to turn your bit_array into either a number or a string. For example, turn [0 1 1 0 1 0] into '011010'. (Matlab only supports scalar or string keys, which is why this step is required.)

  2. Defined a map object

    cachedRunMap = containers.Map;  %See edit below for more on this
    
  3. To check if a particular case has been run, use iskey.

    cachedRunMap.isKey('011010');
    
  4. To add the results of a run use the appending syntax

    cachedRunMap('011010') = [0 1 1 0 1];  %Or whatever your result is.  
    
  5. To retrieve cached results, use the getting syntax

    tmpResult = cachedRunMap.values({'011010'});
    

This should efficiently store and retrieve values until you run out of system memory.


Putting this together, now your code would look like this:

%Hacky magic function to convert an array into a string of '0' and '1'
strFromBits = @(x) char((x(:)'~=0)+48); %'

%Initialize the map
cachedRunMap = containers.Map;

%Loop, computing and storing results as needed
for i=1:1000000000
    %pick a bit array somehow
    strKey = strFromBits(bit_array);
    if cachedRunMap.isKey(strKey)
        result = cachedRunMap(strKey);
    else
        result = complex_function(bit_array);
        cachedRunMap(strKey) = reult;
    end
    %do something with result
end

If you want a key which is not a string, that needs to be declared at step 2. Some examples are:

cachedRunMap = containers.Map('KeyType', 'char', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'double', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'any');
cachedRunMap = containers.Map('KeyType', 'uint64', 'ValueType', 'double');

Setting a KeyType of 'char' sets the map to use strings as keys. All other types must be scalars.


Regarding issues as you scale this up (per your recent comments)

  • Saving data between sessions: There should be no issues saving this map to a *.mat file, up to the limits of your systems memory

  • Purging old data: I am not aware of a straightforward way to add LRU features to this map. If you can find a Java implementation you can use it within Matlab pretty easily. Otherwise it would take some thought to determine the most efficient method of keeping track of the last time a key was used.

  • Sharing data between concurrent sessions: As you indicated, this probably requires a database to perform efficiently. The DB table would be two columns (3 if you want to implement LRU features), the key, value, (and last used time if desired). If your "result" is not a type which easily fits into SQL (e.g. a non-uniform size array, or complex structure) then you will need to put additional thought into how to store it. You will also need a method to access the database (e.g. the database toolbox, or various tools on the Mathworks file exchange). Finally you will need to actually setup a database on a server (e.g. MySql if you are cheap, like me, or whatever you have the most experience with, or can find the most help with.) This is not actually that hard, but it takes a bit of time and effort the first time through.

    Another approach to consider (much less efficient, but not requiring a database) would be to break up the data store into a large (e.g. 1000's or millions) number of maps. Save each into a separate *.mat file, with a filename based on the keys contained in that map (e.g. the first N characters of your string key), and then load/save these files between sessions as needed. This will be pretty slow ... depending on your usage it may be faster to recalculate from the source function each time ... but it's the best way I can think of without setting up the DB (clearly a better answer).



回答2:

  1. For a large list, a hand-coded binary search can beat ismember, if maintaining it in sorted order isn't too expensive. If that's really your bottleneck. Use the profiler to see how much the ismember is really costing you. If there aren't too many distinct values, you could also store them in a containers.Map by packing the bit_matrix in to a char array and using it as the key.

  2. If it's small enough to fit in memory, you could store it in a MAT file using save and load. They can store any basic Matlab datatype. Have the simulation save the accumulated res and bit_matrix at the end of its run, and re-load them the next time it's called.



回答3:

I think that you should use containers.Map() for the purpose of speedup.

The general idea is to hold a map that contains all hash values. If your bit arrays have uniform distribution under the hash function, most of the time you won't need the call to ismember.

Since the key type cannot be an array in Matlab, you can calculate some hash function on your array of bits.

For example:

 function s = GetHash(bitArray)
      s = mod( sum(bitArray), intmax('uint32'));          
 end

This is a lousy hash function, but enough to understand the principle. Then the code would look like:

map = containers.Map('KeyType','uint32','ValueType','any');
for i=1:1000000000
    %pick a bit array somehow
    s = GetHash(bit_array);   
    if isKey  %Do the slow check.
        [~,indx] = ismember(bit_array,bit_matrix,'rows');
    else
       map(s) = 1;
       continue;
    end
    if indx == 0
        indx = length(results) + 1;
        bit_matrix(indx,:) = bit_array;
        res(indx) = complex_function(bit_array);
    end
    result = res(indx)
    %do something with result
end