How to input large data into python pandas using l

2019-03-25 03:45发布

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0, dtype=str)

I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.

Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.

My csv file contains numerous data with '/' as the comma separator,

id    venue           time             code    value ......
AAA   Paris      28/05/2016 09:10      PAR      45   ......
111   Budapest   14/08/2016 19:00      BUD      62   ......
AAA   Tokyo      05/11/2016 23:20      TYO      56   ......
111   LA         12/12/2016 05:55      LAX      05   ......
111   New York   08/01/2016 04:25      NYC      14   ......
AAA   Sydney     04/05/2016 21:40      SYD      2    ......
ABX   HongKong   28/03/2016 17:10      HKG      5    ......
ABX   London     25/07/2016 13:02      LON      22   ......
AAA   Dubai      01/04/2016 18:45      DXB      19   ......
.
.
.
.

5条回答
We Are One
2楼-- · 2019-03-25 04:06

Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv"
chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)

for it, chunk in enumerate(chunks):
    chunk.to_csv('chunk_{}.csv'.format(it), sep="/") 

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

查看更多
迷人小祖宗
3楼-- · 2019-03-25 04:06

pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to read
skiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

查看更多
淡お忘
4楼-- · 2019-03-25 04:08

If you don't need all columns you may also use usecols parameter:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

usecols : array-like or callable, default None

Return a subset of the columns. [...] 
Using this parameter results in much faster parsing time and lower memory usage.
查看更多
神经病院院长
5楼-- · 2019-03-25 04:21
import numpy as np
from multiprocessing import Pool

def processor(df):

    # Some work

    df.sort_values('id', inplace=True)
    return df

size = 8
df_split = np.array_split(df, size)

cores = 8
pool = Pool(cores)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
    frame.to_csv('{}'.format(n))
pool.close()
pool.join()
查看更多
ら.Afraid
6楼-- · 2019-03-25 04:22

You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

查看更多
登录 后发表回答