loading csv column into numpy memmap (fast)

I have a csv file with two columns, holding measurements from an oscilloscope:

Model,MSO4034
Firmware Version,2.48
# ... (15 lines of header) ...
-5.0000000e-02,-0.0088
-4.9999990e-02,0.0116
-4.9999980e-02,0.006
-4.9999970e-02,-0.0028
-4.9999960e-02,-0.002
-4.9999950e-02,-0.0028
-4.9999940e-02,0.0092
-4.9999930e-02,-0.0072
-4.9999920e-02,-0.0008
-4.9999910e-02,-0.0056

This data I'd like to load into a numpy array. I could use np.loadtxt:

np.loadtxt('data.csv', delimiter=',', skiprows=15, usecols=[1])

However, my data file is huge (100 MSamples), which would take numpy over half an hour to load and parse (21.5 ms per 1000 lines).

My preferred approach would be to directly create a Memory Map file for numpy, which just consists of the binary values, concatenated into a single file. It basically is the array in memory, just that it's not in the memory but on disk.

Question

Is there any convenient way of doing this? Using Linux, I could tail away the header and cut out the second column, but I'd still need to parse the values string-representation before writing it into a binary file on disk:

$ tail -n +16 data.csv | cut -d',' -f2
-0.0088
0.0116
0.006
-0.0028
-0.002
-0.0028
0.0092
-0.0072
-0.0008
-0.0056

Is there any Linux command for parsing the string representation of floats and writing them on disk?

标签： python linux csv numpy memory-mapped-files

2条回答

老娘就宠你

2楼-- · 2019-09-03 19:03

I'd also recommend using Pandas' CSV parser, but instead of reading the whole file into memory in one go I would iterate over it in chunks and write these to a memory-mapped array on the fly:

import numpy as np
from numpy.lib.format import open_memmap
import pandas as pd

# make some test data
data = np.random.randn(100000, 2)
np.savetxt('/tmp/data.csv', data, delimiter=',', header='foo,bar')

# we need to specify the shape and dtype in advance, but it would be cheap to
# allocate an array with more rows than required since memmap files are sparse.
mmap = open_memmap('/tmp/arr.npy', mode='w+', dtype=np.double, shape=(100000, 2))

# parse at most 10000 rows at a time, write them to the memmaped array
n = 0
for chunk in pd.read_csv('/tmp/data.csv', chunksize=10000):
    mmap[n:n+chunk.shape[0]] = chunk.values
    n += chunk.shape[0]

print(np.allclose(data, mmap))
# True

You can adjust the chunk size according to how much of the file you can fit in memory at a time. Bear in mind that you'll need to hold the raw text as well as the converted values in memory while you parse a chunk.

0人赞添加讨论(0) 举报

Melony?

3楼-- · 2019-09-03 19:03

Since your data are on disk, you have to import it first and it will be costly.

I think the best csv reader today is the pandas one.

In [7]: %timeit v=pd.read_csv('100ksamples.csv',sep=',')
1 loop, best of 3: 276 ms per loop # for 100k lines

which seems 10 times better than your test ( but it's disk dependent).

After that, you can use tools like pickle to save in binary mode and save time.

In [8]: %timeit with open('e.pk','bw') as f : pickle.dump(v,f)
100 loops, best of 3: 16.2 ms per loop

In [9]: %timeit with open('e.pk','br') as f : v2=pickle.load(f)
100 loops, best of 3: 8.64 ms per loop

0人赞添加讨论(0) 举报

loading csv column into numpy memmap (fast)

Question

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间