How to input large data into python pandas using l

I have a csv file of 8gb and I am not able to run the code as it shows memory error.

file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0, dtype=str)

I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.

Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.

My csv file contains numerous data with '/' as the comma separator,

id    venue           time             code    value ......
AAA   Paris      28/05/2016 09:10      PAR      45   ......
111   Budapest   14/08/2016 19:00      BUD      62   ......
AAA   Tokyo      05/11/2016 23:20      TYO      56   ......
111   LA         12/12/2016 05:55      LAX      05   ......
111   New York   08/01/2016 04:25      NYC      14   ......
AAA   Sydney     04/05/2016 21:40      SYD      2    ......
ABX   HongKong   28/03/2016 17:10      HKG      5    ......
ABX   London     25/07/2016 13:02      LON      22   ......
AAA   Dubai      01/04/2016 18:45      DXB      19   ......
.
.
.
.

标签： python loops csv pandas parallel-processing

5条回答

We Are One

2楼-- · 2019-03-25 04:06

Use the chunksize parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:

file = "./data.csv"
chunks = pd.read_csv(file, sep="/", header=0, dtype=str, chunksize = 100000)

for it, chunk in enumerate(chunks):
    chunk.to_csv('chunk_{}.csv'.format(it), sep="/")

If you know the number of rows of the original file you can calculate the exact chunksize to split the file in 8 equal parts (nrows/8).

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2019-03-25 04:06

pandas read_csv has two argument options that you could use to do what you want to do:

nrows : to specify the number of rows you want to read
skiprows : to specify the first row you want to read

Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

0人赞添加讨论(0) 举报

淡お忘

4楼-- · 2019-03-25 04:08

If you don't need all columns you may also use usecols parameter:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

usecols : array-like or callable, default None

Return a subset of the columns. [...] 
Using this parameter results in much faster parsing time and lower memory usage.

0人赞添加讨论(0) 举报

神经病院院长

5楼-- · 2019-03-25 04:21

import numpy as np
from multiprocessing import Pool

def processor(df):

    # Some work

    df.sort_values('id', inplace=True)
    return df

size = 8
df_split = np.array_split(df, size)

cores = 8
pool = Pool(cores)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
    frame.to_csv('{}'.format(n))
pool.close()
pool.join()

0人赞添加讨论(0) 举报

ら.Afraid

6楼-- · 2019-03-25 04:22

You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.

0人赞添加讨论(0) 举报

How to input large data into python pandas using l

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间