Split really large file into smaller files in Pyth

2019-05-28 16:07发布

问题:

I have a really large csv file (close to a Terabyte) that I want to split into smaller csv files, based on info in each row.

Since there is no way to do that in memory, my intended approach was to read each line, decide which file it should go into, and append it there. This however takes ages, since opening and closing takes too long.

My second approach was to keep all files (about 3000) open - this however does not work since I can't have so many files open in parallel.

Additional details as requested: The .csv file contains map data I need to access region-wise. Therefore, I plan on clustering it into files covering different bounding boxes. Since it is unsorted data, I have to process the lat/lon of each row, assign the correct file to it, and append the row to the file.

What would be a working (fast, ideally) approach for that?

回答1:

This may be somewhat of a hacky method but it would require pandas and doing some batched appends. This will solve the issue with having to open and close files during every row processing. I'm going to assume that the way you triage the rows to your CSV's is based on some value from a column in your large CSV.

import pandas as pd
import os

df_chunked = pd.read_csv("myLarge.csv", chunksize=30000)  # you can alter the chunksize

for chunk in df_chunked:
    uniques = chunk['col'].unique().tolist()
    for val in uniques:
        df_to_write = chunk[chunk['col'] == val]
        if os.path.isfile('small_{}.csv'.format(val)):  # check if file already exists
            df_to_write.to_csv('small_{}.csv'.format(val), mode='a', index=False, header=False)
        else:
            df_to_write.to_csv('small_{}.csv'.format(val), index=False)


回答2:

Although i agree with the comments that knowing the details of the file would be essential to providing an actual working answer, i had a similar issue and resolved it using pandas.

The .read_csv method of pandas will use the csv reader to only partially read the csv files without loading the whole file into memory. Best is to do some trial and error and see how many rows at a time is your system able to support

import pandas as pd
num_rows = 6000 # or some value which depends on system memory 
NewDf1 =pd.DataFrame()
NewDf2 = pd.DataFrame()

for i in range(start,stop,num_rows):
    rawdata_df = pd.read_csv(filename,skiprows=range(1,i),nrows=num_rows) 
    NewDf1.append(rawdata_df[rawdata_df.iloc[1]>0],ignore_index=True)
    NewDf2.append(rawdata_df[rawdata_df.iloc[2]>0],ignore_index=True)

At the end depending on your actual system memory and structure of files (because the size of dataframe depends also on number of columns and structure of data) you can save the separate dataframes in separate csvs

 if len(NewDf1) > toolarge:
      NewDf1.to_csv('newdf1Filename.csv')