How to split a huge csv file based on content of f

I have a 250MB+ huge csv file to upload
file format is group_id, application_id, reading and data could look like

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)

I want to divide the file based on group_id, so output should be n files where n=group_id

Output

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)

How can I do this effectively?

标签： python linux unix ubuntu

7条回答

唯我独甜

2楼-- · 2019-01-09 07:46

If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

0人赞添加讨论(0) 举报

Explosion°爆炸

3楼-- · 2019-01-09 07:52

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

0人赞添加讨论(0) 举报

女痞

4楼-- · 2019-01-09 07:59

awk is capable:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv

0人赞添加讨论(0) 举报

成全新的幸福

5楼-- · 2019-01-09 08:04

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

0人赞添加讨论(0) 举报

三岁会撩人

6楼-- · 2019-01-09 08:09

How about:

Read the input file a line at a time
split() each line on , to get the group_id
For each new group_id you find, open an output file
- add each groupid to a set/dict as you find them so you can keep track
write the line to the appropriate file
Done!

0人赞添加讨论(0) 举报

祖国的老花朵

7楼-- · 2019-01-09 08:11

Here some food for though for you:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

0人赞添加讨论(0) 举报

1 2 下一页

How to split a huge csv file based on content of f

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间