I have a csv file of 8gb and I am not able to run the code as it shows memory error.
file = "./data.csv"
df = pd.read_csv(file, sep="/", header=0, dtype=str)
I would like to split the files into 8 small files ("sorted by id") using python. And fianlly,have a loop so that the output file will have the output of all 8 files.
Or I would like to try parallel computing. Main goal is to process 8gb data in python pandas. Thank you.
My csv file contains numerous data with '/' as the comma separator,
id venue time code value ......
AAA Paris 28/05/2016 09:10 PAR 45 ......
111 Budapest 14/08/2016 19:00 BUD 62 ......
AAA Tokyo 05/11/2016 23:20 TYO 56 ......
111 LA 12/12/2016 05:55 LAX 05 ......
111 New York 08/01/2016 04:25 NYC 14 ......
AAA Sydney 04/05/2016 21:40 SYD 2 ......
ABX HongKong 28/03/2016 17:10 HKG 5 ......
ABX London 25/07/2016 13:02 LON 22 ......
AAA Dubai 01/04/2016 18:45 DXB 19 ......
.
.
.
.
Use the
chunksize
parameter to read one chunk at the time and save the files to disk. This will split the original file in equal parts by 100000 rows each:If you know the number of rows of the original file you can calculate the exact
chunksize
to split the file in 8 equal parts (nrows/8
).pandas read_csv has two argument options that you could use to do what you want to do:
Refer to documentation at: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
If you don't need all columns you may also use
usecols
parameter:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
You also might want to use the das framework and it's built in dask.dataframe. Essentially, the csv file is transformed into multiple pandas dataframes, each read in when necessary. However, not every pandas command is avaialble within dask.