How can I break down a large csv file into small f

2020-05-10 09:10发布

问题:

What I want to do:

What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB column that alos contain 1 in the HH column, and all uncommon records that contain 0 in HH column.

As a result, all files will contain common records in BB column that contain 1 in the HH column, and all uncommon records that has no records in BB column and contain 0 in the HH column. The file name should be based on the common record of column 2 (BB). Please take a look below for the scenarion. Any suggestion idea is appreciated highly.

bigFile.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5     960     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

My expected results files woud be as follows:

53.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

59.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

61.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5    960      1
   12             68b       95     3       5    334      0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4    215      0
   12             a10       36     5       1             0      

回答1:

For the data you have provided, the following script will produce your requested output files. It will perform this operation on ALL CSV files found in the folder:

from itertools import groupby
import glob
import csv
import os

def remove_unwanted(rows):
    return [['' if col == 'NULL' else col for col in row[2:]] for row in rows]

output_folder = 'temp'  # make sure this folder exists

# Search for ALL CSV files in the current folder
for csv_filename in glob.glob('*.csv'):
    with open(csv_filename) as f_input:
        basename = os.path.splitext(os.path.basename(csv_filename))[0]      # e.g. bigfile

        csv_input = csv.reader(f_input)
        header = next(csv_input)
        # Create a list of entries with '0' in last column
        id_list = remove_unwanted(row for row in csv_input if row[7] == '0')
        f_input.seek(0)     # Go back to the start
        header = remove_unwanted([next(csv_input)])

        for k, g in groupby(csv_input, key=lambda x: x[1]):
            if k == '':
                break

            # Format an output file name in the form 'bigfile_53.csv'
            file_name = os.path.join(output_folder, '{}_{}.csv'.format(basename, k))

            with open(file_name, 'wb') as f_output:
                csv_output = csv.writer(f_output)
                csv_output.writerows(header)
                csv_output.writerows(remove_unwanted(g))
                csv_output.writerows(id_list)

This will result in the files bigfile_53.csv, bigfile_59.csv and bigfile_61.csv being created in an output folder called temp. For example bigfile_53.csv will appear as follows:

Entries containing the string 'NULL' will be converted to an empty string, and the first two columns will be removed (as per OP's comment).

Tested in Python 2.7.9



回答2:

You should look into the csv module. You can read your input file line by line and sort each line according to the BB column. This should be easy to do with a dictionary who's keys are the value in the BB column and the values are a list containing the information from that row. You can then write these lists to csv files using the csv module.