loading csv column into numpy memmap (fast)

2019-09-03 18:26发布

问题:

I have a csv file with two columns, holding measurements from an oscilloscope:

Model,MSO4034
Firmware Version,2.48
# ... (15 lines of header) ...
-5.0000000e-02,-0.0088
-4.9999990e-02,0.0116
-4.9999980e-02,0.006
-4.9999970e-02,-0.0028
-4.9999960e-02,-0.002
-4.9999950e-02,-0.0028
-4.9999940e-02,0.0092
-4.9999930e-02,-0.0072
-4.9999920e-02,-0.0008
-4.9999910e-02,-0.0056

This data I'd like to load into a numpy array. I could use np.loadtxt:

np.loadtxt('data.csv', delimiter=',', skiprows=15, usecols=[1])

However, my data file is huge (100 MSamples), which would take numpy over half an hour to load and parse (21.5 ms per 1000 lines).

My preferred approach would be to directly create a Memory Map file for numpy, which just consists of the binary values, concatenated into a single file. It basically is the array in memory, just that it's not in the memory but on disk.


Question

Is there any convenient way of doing this? Using Linux, I could tail away the header and cut out the second column, but I'd still need to parse the values string-representation before writing it into a binary file on disk:

$ tail -n +16 data.csv | cut -d',' -f2
-0.0088
0.0116
0.006
-0.0028
-0.002
-0.0028
0.0092
-0.0072
-0.0008
-0.0056

Is there any Linux command for parsing the string representation of floats and writing them on disk?

回答1:

I'd also recommend using Pandas' CSV parser, but instead of reading the whole file into memory in one go I would iterate over it in chunks and write these to a memory-mapped array on the fly:

import numpy as np
from numpy.lib.format import open_memmap
import pandas as pd

# make some test data
data = np.random.randn(100000, 2)
np.savetxt('/tmp/data.csv', data, delimiter=',', header='foo,bar')

# we need to specify the shape and dtype in advance, but it would be cheap to
# allocate an array with more rows than required since memmap files are sparse.
mmap = open_memmap('/tmp/arr.npy', mode='w+', dtype=np.double, shape=(100000, 2))

# parse at most 10000 rows at a time, write them to the memmaped array
n = 0
for chunk in pd.read_csv('/tmp/data.csv', chunksize=10000):
    mmap[n:n+chunk.shape[0]] = chunk.values
    n += chunk.shape[0]

print(np.allclose(data, mmap))
# True

You can adjust the chunk size according to how much of the file you can fit in memory at a time. Bear in mind that you'll need to hold the raw text as well as the converted values in memory while you parse a chunk.



回答2:

Since your data are on disk, you have to import it first and it will be costly.

I think the best csv reader today is the pandas one.

In [7]: %timeit v=pd.read_csv('100ksamples.csv',sep=',')
1 loop, best of 3: 276 ms per loop # for 100k lines

which seems 10 times better than your test ( but it's disk dependent).

After that, you can use tools like pickle to save in binary mode and save time.

In [8]: %timeit with open('e.pk','bw') as f : pickle.dump(v,f)
100 loops, best of 3: 16.2 ms per loop

In [9]: %timeit with open('e.pk','br') as f : v2=pickle.load(f)
100 loops, best of 3: 8.64 ms per loop