I'm working on a project, written in Java, which requires that I build a very large 2-D sparse array. Very sparse, if that makes a difference. Anyway: the most crucial aspect for this application is efficency in terms of time (assume loads of memory, though not nearly so unlimited as to allow me to use a standard 2-D array -- the key range is in the billions in both dimensions).
Out of the kajillion cells in the array, there will be several hundred thousand cells which contain an object. I need to be able to modify cell contents VERY quickly.
Anyway: Does anyone know a particularly good library for this purpose? It would have to be Berkeley, LGPL or similar license (no GPL, as the product can't be entirely open-sourced). Or if there's just a very simple way to make a homebrew sparse array object, that'd be fine too.
I'm considering MTJ, but haven't heard any opinions on its quality.
Not the fastest runtime solution probably, but the fastest I could come up with that seems to work. Create an Index class and use it as a key for a SortedMap, like:
My Index class looks like this (with some help from Eclipse code generator).
you could just use a nested map although if you need to do matrix calculus on it that might not be the best option
maybe instead of object use some tuple for the actual data so you can work with it easier after extraction, something like:
null check etc omitted for brevity
You migth look at la4j (Linear Algebra for Java) library. It supports CRS (Compressed Row Storage) as well as CCS (Compressed Column Storage) internal representaions for sparse matrices. So, these are the most efficient and fast internal stuctures for sparse data.
Here is the brief example of using sparse matrices in la4j:
HashMap rocks. Just concatenate the indexes (as strings) with a separator, say '/', using StringBuilder (not + or String.format), and use that as the key. You can't get faster and more memory-efficient than that. Sparse matrices are soo 20th century. :-)
Sparsed arrays built with hashmaps are very inefficient for frequently read data. The most efficient implementations uses a Trie that allows access to a single vector where segments are distributed.
A Trie can compute if an element is present in the table by performing only read-only TWO array indexing to get the effective position where an element is stored, or to know if its absent from the underlying store.
It can also provide a default position in the backing store for the default value of the sparsed array, so that you don't need ANY test on the returned index, because the Trie guarantees that all possible source index will map at least to the default position in the backing store (where you'll frequently store a zero, or an empty string or a null object).
There exists implementations that support fast-updatable Tries, with an otional "compact()" operation to optimze the size of the backing store at end of multiple operations. Tries are MUCH faster than hashmaps, because they don't need any complex hashing function, and don't need to handle collisions for reads (With Hashmaps, you have collision BOTH for reading and for writing, this requires a loop to skip to the next candidate position, and a test on each of them to compare the effective source index...)
In addition, Java Hashmaps can only index on Objects, and creating an Integer object for each hashed source index (this object creation will be needed for every read, not just writes) is costly in terms of memory operations, as it stresses the garbage collector.
I really hoped that the JRE included an IntegerTrieMap<Object> as the default implementation for the slow HashMap<Integer, Object> or LongTrieMap<Object> as the default implementation for the even slower HashMap<Long, Object>... But this is still not the case.
You may wonder what is a Trie?
It's just a small array of integers (in a smaller range than the full range of coordinates for your matrix) that allows mapping the coordinates into an integer position in a vector.
For example suppose you want a 1024*1024 matrix containing only a few non zero values. Instead of storing that matrix in a array containing 1024*1024 elements (more than 1 million), you may just want to split it into subranges of size 16*16, and you'll just need 64*64 such subranges.
In that case, the Trie index will contain just 64*64 integers (4096), and there will be at least 16*16 data elements (containing the default zeroes, or the most common subrange found in your sparse matrix).
And the vector used to store the values will contain only 1 copy for subranges that are equal with each other (most of them being full of zeroes, they will be represented by the same subrange).
So instead of using a syntax like
matrix[i][j]
, you'd use a syntax like:which will be more conveniently handled using an access method for the trie object.
Here is an example, built into a commented class (I hope it compiles OK, as it was simplified; signal me if there are errors to correct):
Note: this code is not complete because it handles a single matrix size, and its compactor is limited to detect only common subranges, without interleaving them.
Also, the code does not determine where it is the best width or height to use for splitting the matrix into subranges (for x or y coordinates), according to the matrix size. It just uses the same static subrange sizes of 16 (for both coordinates), but it could be conveniently any other small power of 2 (but a non power of 2 would slow down the
int indexOf(int, int)
andint offsetOf(int, int)
internal methods), independantly for both coordinates, and up to the maximum width or height of the matrix.ideally thecompact()
method should be able to determine the best fitting sizes.If these splitting subranges sizes can vary, then there will be a need to add instance members for these subrange sizes instead of the static
SUBRANGE_POSITIONS
, and to make the static methodsint subrangeOf(int i, int j)
andint positionOffsetOf(int i, int j)
into non static; and the initialization arraysDEFAULT_POSITIONS
andDEFAULT_VALUES
will need to be dropped or redefined differently.If you want to support interleaving, basically you'll start by dividing the existing values in two of about the same size (both being a multiple of the minimum subrange size, the first subset possibly having one more subrange than the second one), and you'll scan the larger one at all successive positions to find a matching interleaving; then you'll try to match these values. Then you'll loop recursely by dividing the subsets in halves (also a multiple of the minimum subrange size) and you'll scan again to match these subsets (this will multiply the number of subsets by 2: you have to wonder if the doubled size of the subrangePositions index is worth the value compared to the existing size of the values to see if it offers an effective compression (if not, you stop there: you have found the optimum subrange size directly from the interleaving compression process). In that case; the subrange size will be mutable, during compaction.
But this code shows how you assign non-zero values and reallocate the
data
array for additional (non-zero) subranges, and then how you can optimize (withcompact()
after assignments have been performed using thesetAt(int i, int j, double value)
method) the storage of this data when there are duplicate subranges that may be unified within the data, and reindexed at the same position in thesubrangePositions
array.Anyway, all the principles of a trie are implemented there:
It is always faster (and more compact in memory, meaning better locality) to represent a matrix using a single vector instead of a double-indexed array of arrays (each one allocated separately). The improvement is visible in the
double getAt(int, int)
method !You save a lot of space, but when assigning values it may take time to reallocate new subranges. For this reason, the subranges should not be too small or reallocations will occur too frequently for setting up your matrix.
It is possible to transform an initial large matrix automatically into a more compact matrix by detecting common subranges. A typical implementation will then contain a method such as
compact()
above. However, if get() access is very fast and set() is quite fast, compact() may be very slow if there are lots of common subranges to compress (for example when substracting a large non-sparse randomly-filled matrix with itself, or multiplying it by zero: it will be simpler and much faster in that case to reset the trie by instanciating a new one and dropping the old one).Common subranges use common storage in the data, so this shared data must be read-only. If you must change a single value without changing the rest of the matrix, you must first make sure that it is referenced only one time in the
subrangePositions
index. Otherwise you'll need to allocate a new subrange anywhere (conveniently at end) of thevalues
vector, and then store the position of this new subrange into thesubrangePositions
index.Note that the generic Colt library, though very good as it is, is not as good when working on sparse matrice, because it uses hashing (or row-compresssed) technics which do not implement the support for tries for now, despite it is an excellent optimization, which is both space-saving and time-saving, notably for the most frequent getAt() operations.
Even the setAt() operation described here for tries saves lot of time (the way is is implemented here, i.e. without automatic compaction after setting, which could still be implemented based on demand and estimated time where compaction would still save lot of storage space at the price of time): the time saving is proportional to the number of cells in subranges, and space saving is inversely proportional to the number of cells per subrange. A good compromize if then to use a subrange size such the number of cells per subrange is the square root of the total number of cells in a 2D matrix (it would be a cubic root when working with 3D matrice).
Hashing technics used in Colt sparse matrix implementations have the inconvenience that they add a lot of storage overhead, and slow access time due to possible collisions. Tries can avoid all collisions, and can then warranty to save linear O(n) time to O(1) time in the worst cases, where (n) is the number of possible collisions (which, in case of sparse matrix, may be up to the number of non-default-value cells in the matrix, i.e. up to the total number of size of the matrix times a factor proportional to the hashing filling factor, for a non-sparse i.e. full matrix).
The RC (row-compressed) technics used in Colt are nearer from Tries, but this is at another price, here the compression technics used, that have very slow access time for the most frequent read-only get() operations, and very slow compression for setAt() operations. In addition, the compression used is not orthogonal, unlike in this presentation of Tries where orthogonality is preserved. Tries would also be preserve this orthogonality for related viewing operations such as striding, transposition (viewed as a striding operation based on integer cyclic modular operations), subranging (and subselections in general, including with sorting views).
I just hope that Colt will be updated in some future to implement another implementation using Tries (i.e. TrieSparseMatrix instead of just HashSparseMatrix and RCSparseMatrix). The ideas are in this article.
The Trove implementation (based in int->int maps) are also based on hashing technics similar to the Colt's HashedSparseMatrix, i.e. they have the same inconvenience. Tries will be a lot faster, with a moderate additional space consumed (but this space can be optimized and become even better than Trove and Colt, in a deferred time, using a final compact()ion operation on the resulting matrix/trie).
Note: this Trie implementation is bound to a specific native type (here double). This is voluntary, because generic implementation using boxing types have a huge space overhead (and are much slower in acccess time). Here it just uses native unidimensional arrays of double rather than generic Vector. But it is certainly possible to derive a generic implementation as well for Tries... Unfortunately, Java still does not allow writing really generic classes with all the benefits of native types, except by writing multiple implementations (for a generic Object type or for each native type), and serving all these operation via a type factory. The language should be able to automatically instanciate the native implementations and build the factory automatically (for now it is not the case even in Java 7, and this is something where .Net still maintains its advantage for really generic types that are as fast as native types).
This seems to be simple.
You could use a binary tree of the data using row*maxcolums+column as an index.
To find item, you simply calculate row*maxcolums+column and binary search the tree looking for it, if it's not there, you can return null (it's О(log n) where n is the number of cells which contain an object).