How to split a huge csv file based on content of f

2019-01-09 07:31发布

  • I have a 250MB+ huge csv file to upload
  • file format is group_id, application_id, reading and data could look like
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)  
  • I want to divide the file based on group_id, so output should be n files where n=group_id

Output

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)  

How can I do this effectively?

7条回答
唯我独甜
2楼-- · 2019-01-09 07:46

If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

查看更多
Explosion°爆炸
3楼-- · 2019-01-09 07:52

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")
查看更多
女痞
4楼-- · 2019-01-09 07:59

awk is capable:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
查看更多
成全新的幸福
5楼-- · 2019-01-09 08:04

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

查看更多
三岁会撩人
6楼-- · 2019-01-09 08:09

How about:

  • Read the input file a line at a time
  • split() each line on , to get the group_id
  • For each new group_id you find, open an output file
    • add each groupid to a set/dict as you find them so you can keep track
  • write the line to the appropriate file
  • Done!
查看更多
祖国的老花朵
7楼-- · 2019-01-09 08:11

Here some food for though for you:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)
查看更多
登录 后发表回答