I have a csv file with two columns, holding measurements from an oscilloscope:
Model,MSO4034
Firmware Version,2.48
# ... (15 lines of header) ...
-5.0000000e-02,-0.0088
-4.9999990e-02,0.0116
-4.9999980e-02,0.006
-4.9999970e-02,-0.0028
-4.9999960e-02,-0.002
-4.9999950e-02,-0.0028
-4.9999940e-02,0.0092
-4.9999930e-02,-0.0072
-4.9999920e-02,-0.0008
-4.9999910e-02,-0.0056
This data I'd like to load into a numpy array. I could use np.loadtxt
:
np.loadtxt('data.csv', delimiter=',', skiprows=15, usecols=[1])
However, my data file is huge (100 MSamples), which would take numpy over half an hour to load and parse (21.5 ms per 1000 lines).
My preferred approach would be to directly create a Memory Map file for numpy, which just consists of the binary values, concatenated into a single file. It basically is the array in memory, just that it's not in the memory but on disk.
Question
Is there any convenient way of doing this? Using Linux, I could tail away the header and cut out the second column, but I'd still need to parse the values string-representation before writing it into a binary file on disk:
$ tail -n +16 data.csv | cut -d',' -f2
-0.0088
0.0116
0.006
-0.0028
-0.002
-0.0028
0.0092
-0.0072
-0.0008
-0.0056
Is there any Linux command for parsing the string representation of floats and writing them on disk?
I'd also recommend using Pandas' CSV parser, but instead of reading the whole file into memory in one go I would iterate over it in chunks and write these to a memory-mapped array on the fly:
You can adjust the chunk size according to how much of the file you can fit in memory at a time. Bear in mind that you'll need to hold the raw text as well as the converted values in memory while you parse a chunk.
Since your data are on disk, you have to import it first and it will be costly.
I think the best csv reader today is the
pandas
one.which seems 10 times better than your test ( but it's disk dependent).
After that, you can use tools like
pickle
to save in binary mode and save time.