Faster huge data-import than Get[“raggedmatrix.mx”

2019-03-13 05:21发布

Can anybody advise an alternative to importing a couple of GByte of numeric data (in .mx form) from a list of 60 .mx files, each about 650 MByte?

The - too large to post here - research-problem involved simple statistical operations with double as much GB of data (around 34) than RAM available (16). To handle the data size problem I just split things up and used a Get / Clear strategy to do the math.

It does work, but calling Get["bigfile.mx"] takes quite some time, so I was wondering if it would be quicker to use BLOBs or whatever with PostgreSQL or MySQL or whatever database people use for GB of numeric data.

So my question really is: What is the most efficient way to handle truly large data set imports in Mathematica?

I have not tried it yet, but I think that SQLImport from DataBaseLink will be slower than Get["bigfile.mx"].

Anyone has some experience to share?

(Sorry if this is not a very specific programming question, but it would really help me to move on with that time-consuming finding-out-what-is-the-best-of-the-137-possibilities-to-tackle-a-problem-in-Mathematica).

2条回答
时光不老,我们不散
2楼-- · 2019-03-13 05:52

Here's an idea:

You said you have a ragged matrix, i.e. a list of lists of different lengths. I'm assuming floating point numbers.

You could flatten the matrix to get a single long packed 1D array (use Developer`ToPackedArray to pack it if necessary), and store the starting indexes of the sublists separately. Then reconstruct the ragged matrix after the data has been imported.


Here's a demonstration that within Mathematica (i.e. after import), extracting the sublists from a big flattened list is fast.

data = RandomReal[1, 10000000];

indexes = Union@RandomInteger[{1, 10000000}, 10000];    
ranges = #1 ;; (#2 - 1) & @@@ Partition[indexes, 2, 1];

data[[#]] & /@ ranges; // Timing

{0.093, Null}

Alternatively store a sequence of sublist lengths and use Mr.Wizard's dynamicPartition function which does exactly this. My point is that storing the data in a flat format and partitioning it in-kernel is going to add negligible overhead.


Importing packed arrays as MX files is very fast. I only have 2 GB of memory, so I cannot test on very large files, but the import times are always a fraction of a second for packed arrays on my machine. This will solve the problem that importing data that is not packed can be slower (although as I said in the comments on the main question, I cannot reproduce the kind of extreme slowness you mention).


If BinaryReadList were fast (it isn't as fast as reading MX files now, but it looks like it will be significantly sped up in Mathematica 9), you could store the whole dataset as one big binary file, without the need of breaking it into separate MX files. Then you could import relevant parts of the file like this:

First make a test file:

In[3]:= f = OpenWrite["test.bin", BinaryFormat -> True]

In[4]:= BinaryWrite[f, RandomReal[1, 80000000], "Real64"]; // Timing
Out[4]= {9.547, Null}

In[5]:= Close[f]

Open it:

In[6]:= f = OpenRead["test.bin", BinaryFormat -> True]    

In[7]:= StreamPosition[f]

Out[7]= 0

Skip the first 5 million entries:

In[8]:= SetStreamPosition[f, 5000000*8]

Out[8]= 40000000

Read 5 million entries:

In[9]:= BinaryReadList[f, "Real64", 5000000] // Length // Timing    
Out[9]= {0.609, 5000000}

Read all the remaining entries:

In[10]:= BinaryReadList[f, "Real64"] // Length // Timing    
Out[10]= {7.782, 70000000}

In[11]:= Close[f]

(For comparison, Get usually reads the same data from an MX file in less than 1.5 seconds here. I am on WinXP btw.)


EDIT If you are willing to spend time on this, and write some C code, another idea is to create a library function (using Library Link) that will memory-map the file (link for Windows), and copy it directly into an MTensor object (an MTensor is just a packed Mathematica array, as seen from the C side of Library Link).

查看更多
我命由我不由天
3楼-- · 2019-03-13 05:54

I think the two best approaches are either:

1) use Get on the *.mx file,

2) or read in that data and save it in some binary format for which you write a LibraryLink code and then read the stuff via that. That, of course, has the disadvantage that you'd need to convert your MX stuff. But perhaps this is an option.

Generally speaking Get with MX files is pretty fast.

Are sure this is not a swapping problem?

Edit 1: You could then use also write in an import converter: tutorial/DevelopingAnImportConverter

查看更多
登录 后发表回答