I'm trying to upload a csv file, which is 250MB. Basically 4 million rows and 6 columns of time series data (1min). The usual procedure is:
location = r'C:\Users\Name\Folder_1\Folder_2\file.csv'
df = pd.read_csv(location)
This procedure takes about 20 minutes !!!. Very preliminary I have explored the following options
I wonder if anybody has compared these options (or more) and there's a clear winner. If nobody answers, In the future I will post my results. I just don't have time right now.
Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV - 492 MB).
Comparison for the following storage formats: (
CSV
,CSV.gzip
,Pickle
,HDF5
[various compression]):reading
writing/saving
file size ratio in relation to uncompressed CSV file
RAW DATA:
CSV:
CSV.gzip:
Pickle:
HDF (
format='fixed'
) [Default]:HDF (
format='table'
):HDF (
format='table', complib='zlib', complevel=5
):HDF (
format='table', complib='zlib', complevel=9
):HDF (
format='table', complib='bzip2', complevel=5
):HDF (
format='table', complib='bzip2', complevel=9
):PS i can't test
feather
on my Windows notebookDF info:
File sizes:
Conclusion:
Pickle
andHDF5
are much faster, butHDF5
is more convenient - you can store multiple tables/frames inside, you can read your data conditionally (look atwhere
parameter in read_hdf()), you can also store your data compressed (zlib
- is faster,bzip2
- provides better compression ratio), etc.PS if you can build/use
feather-format
- it should be even faster compared toHDF5
andPickle
PPS: don't use Pickle for big data frames, as you may end up with SystemError: error return without exception set error message. It's also described here and here.