Can anybody advise an alternative to importing a couple of GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?
The - too large to post here - research-problem involved simple statistical operations with double as much GB of data (around 34) than RAM available (16). To handle the data size problem I just split things up and used a Get / Clear strategy to do the math.
It does work, but calling Get["bigfile.mx"]
takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.
So my question really is: What is the most efficient way to handle truly large data set imports in Mathematica?
I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"]
.
Anyone has some experience to share?
(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).
Here's an idea:
You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.
You could flatten the matrix to get a single long packed 1D array (use
Developer`ToPackedArray
to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.
Alternatively store a sequence of sublist lengths and use Mr.Wizard's
dynamicPartition
function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).
If
BinaryReadList
were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:First make a test file:
Open it:
Skip the first 5 million entries:
Read 5 million entries:
Read all the remaining entries:
(For comparison,
Get
usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an
MTensor
object (anMTensor
is just a packed Mathematica array, as seen from the C side of Library Link).I think the two best approaches are either:
1) use Get on the *.mx file,
2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.
Generally speaking Get with MX files is pretty fast.
Are sure this is not a swapping problem?
Edit 1: You could then use also write in an import converter: tutorial/DevelopingAnImportConverter