- I have a 250MB+ huge csv file to upload
- file format is
group_id, application_id, reading
and data could look like
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)
- I want to divide the file based on
group_id
, so output should be n files where n=group_id
Output
File 1
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
and
File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
and
File n
n, x, 0.3(lets say)
How can I do this effectively?
If the file is already sorted by group_id
, you can do something like:
import csv
from itertools import groupby
for key, rows in groupby(csv.reader(open("foo.csv")),
lambda row: row[0]):
with open("%s.txt" % key, "w") as output:
for row in rows:
output.write(",".join(row) + "\n")
awk
is capable:
awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv
Sed one-liner:
sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile
The only down-side is that you need to put in n -e
statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.
The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!
If the rows are sorted by group_id
, then itertools.groupby
would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv
to load the file (in case you didn't already know about it).
If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.
Here some food for though for you:
import csv
from collections import namedtuple
csvfile = namedtuple('scvfile',('file','writer'))
class CSVFileCollections(object):
def __init__(self,prefix,postfix):
self.prefix = prefix
self.files = {}
def __getitem__(self,item):
if item not in self.files:
file = open(self.prefix+str(item)+self.postfix,'wb')
writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
self.files[item] = csvfile(file,writer)
return self.files[item].writer
def __enter__(self): pass
def __exit__(self, exc_type, exc_value, traceback):
for csvfile in self.files.values() : csvfile.file.close()
with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
reader = csv.reader(readFile, delimiter=",", quotechar="'")
for row in reader:
writer = output[row[0]]
writer.writerow(row)