- I have a 250MB+ huge csv file to upload
- file format is
group_id, application_id, reading
and data could look like
1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 ..... n, x, 0.3(lets say)
- I want to divide the file based on
group_id
, so output should be n files wheren=group_id
Output
File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9
and
File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9 .....
and
File n n, x, 0.3(lets say)
How can I do this effectively?
If the rows are sorted by
group_id
, thenitertools.groupby
would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Usecsv
to load the file (in case you didn't already know about it).If the file is already sorted by
group_id
, you can do something like:awk
is capable:If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.
How about:
split()
each line on,
to get thegroup_id
Here some food for though for you: