可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a really large csv file (close to a Terabyte) that I want to split into smaller csv files, based on info in each row.

Since there is no way to do that in memory, my intended approach was to read each line, decide which file it should go into, and append it there. This however takes ages, since opening and closing takes too long.

My second approach was to keep all files (about 3000) open - this however does not work since I can't have so many files open in parallel.

Additional details as requested: The .csv file contains map data I need to access region-wise. Therefore, I plan on clustering it into files covering different bounding boxes. Since it is unsorted data, I have to process the lat/lon of each row, assign the correct file to it, and append the row to the file.

What would be a working (fast, ideally) approach for that?

回答1:

This may be somewhat of a hacky method but it would require pandas and doing some batched appends. This will solve the issue with having to open and close files during every row processing. I'm going to assume that the way you triage the rows to your CSV's is based on some value from a column in your large CSV.

import pandas as pd
import os

df_chunked = pd.read_csv("myLarge.csv", chunksize=30000)  # you can alter the chunksize

for chunk in df_chunked:
    uniques = chunk['col'].unique().tolist()
    for val in uniques:
        df_to_write = chunk[chunk['col'] == val]
        if os.path.isfile('small_{}.csv'.format(val)):  # check if file already exists
            df_to_write.to_csv('small_{}.csv'.format(val), mode='a', index=False, header=False)
        else:
            df_to_write.to_csv('small_{}.csv'.format(val), index=False)

回答2:

Although i agree with the comments that knowing the details of the file would be essential to providing an actual working answer, i had a similar issue and resolved it using pandas.

The .read_csv method of pandas will use the csv reader to only partially read the csv files without loading the whole file into memory. Best is to do some trial and error and see how many rows at a time is your system able to support

import pandas as pd
num_rows = 6000 # or some value which depends on system memory 
NewDf1 =pd.DataFrame()
NewDf2 = pd.DataFrame()

for i in range(start,stop,num_rows):
    rawdata_df = pd.read_csv(filename,skiprows=range(1,i),nrows=num_rows) 
    NewDf1.append(rawdata_df[rawdata_df.iloc[1]>0],ignore_index=True)
    NewDf2.append(rawdata_df[rawdata_df.iloc[2]>0],ignore_index=True)

At the end depending on your actual system memory and structure of files (because the size of dataframe depends also on number of columns and structure of data) you can save the separate dataframes in separate csvs

 if len(NewDf1) > toolarge:
      NewDf1.to_csv('newdf1Filename.csv')

Split really large file into smaller files in Pyth

问题:

回答1:

回答2:

收藏的人(0)

Split really large file into smaller files in Pyth

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮