first question here. I'll try to be concise.
I am generating multiple arrays containing feature information for a machine learning application. As the arrays do not have equal dimensions, I store them in a dictionary rather than in an array. There are two different kinds of features, so I am using two different dictionaries.
I also generate labels to go with the features. These labels are stored in arrays. Additionally, there are strings containing the exact parameters used for running the script and a timestamp.
All in all it looks like this:
import numpy as np
feature1 = {}
feature2 = {}
label1 = np.array([])
label2 = np.array([])
docString = 'Commands passed to the script were...'
# features look like this:
feature1 = {'case 1': np.array([1, 2, 3, ...]),
'case 2': np.array([2, 1, 3, ...]),
'case 3': np.array([2, 3, 1, ...]),
and so on... }
Now my goal would be to do this:
np.savez(outputFile,
saveFeature1 = feature1,
saveFeature2 = feature2,
saveLabel1 = label1,
saveLabel2 = label2,
saveString = docString)
This seemingly works (i.e. such a file is saved with no error thrown and can be loaded again). However, when I try to load for example the feature from the file again:
loadedArchive = np.load(outFile)
loadedFeature1 = loadedArchive['saveFeature1']
loadedString = loadedArchive['saveString']
Then instead of getting a dictionary back, I get a numpy array of shape (0) where I don't know how to access the contents:
In []: loadedFeature1
Out[]:
array({'case 1': array([1, 2, 3, ...]),
'case 2': array([2, 3, 1, ...]),
..., }, dtype=object)
Also strings become arrays and get a strange datatype:
In []: loadedString.dtype
Out[]: dtype('|S20')
So in short, I am assuming this is not how it is done correctly. However I would prefer not to put all variables into one big dictionary because I will retrieve them in another process and would like to just loop over the dictionary.keys() without worrying about string comparison.
Any ideas are greatly appreciated. Thanks
Put all your variables into an object and then use Pickle. It's a better way to store state information.
If you need to save your data in a structured way, you should consider using the HDF5 file format (http://www.hdfgroup.org/HDF5/). It is very flexible, easy to use, efficient, and other software might already support it (HDFView, Mathematica, Matlab, Origin..). There is a simple python binding called h5py.
You can store datasets in a filesystem like structure and define attributes for each dataset, like a dictionary. For example:
Reading the data is also simple, you can even load just a few elements out of a large file if you want:
More features and possibilities are found in the documentation and on the websites (the Quick Start Guide might be of interest).
As @fraxel has already suggested, using pickle is a much better option in this case. Just save a
dict
with your items in it.However, be sure to use pickle with a binary protocol. By default, it less efficient format, which will result in excessive memory usage and huge files if your arrays are large.
That having been said, let's take a look at what's happening in more detail for illustrative purposes.
numpy.savez
expects each item to be an array. In fact, it callsnp.asarray
on everything you pass in.If you turn a
dict
into an array, you'll get an object array. E.g.Similarly, if you make an array out of a string, you'll get a string array:
However, because of a quirk in the way object arrays are handled, if you pass in a single object (in your case, your
dict
) that isn't a tuple, list, or array, you'll get a 0-dimensional object array.This means that you can't index it directly. In fact, doing
testarr[0]
will raise anIndexError
. The data is still there, but you need to add a dimension first, so you have to doyourdictionary = testarr.reshape(-1)[0]
.If all of this seems clunky, it's because it is. Object arrays are essentially always the wrong answer. (Although
asarray
should arguably pass inndmin=1
toarray
, which would solve this particular problem, but potentially break other things.)savez
is intended to store arrays, rather than arbitrary objects. Because of the way it works, it can store completely arbitrary objects, but it shouldn't be used that way.If you did want to use it, though, a quick workaround would be to do:
And you'd then access things with
However, this is clearly much more clunky than just using pickle. Use
numpy.savez
when you're just saving arrays. In this case, you're saving nested data structures, not arrays.