I have a 7GB csv
file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?
问题:
回答1:
You don't need Python to split a csv file. Using your shell:
$ split -l 100 data.csv
Would split data.csv
in chunks of 100 lines.
回答2:
I had to do a similar task, and used the pandas package:
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
chunk.to_csv('chunk{}.csv'.format(i), index='False')
回答3:
Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import csv
divisor = 10
outfileno = 1
outfile = None
with open('big.csv', 'r') as infile:
for index, row in enumerate(csv.reader(infile)):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(row)
回答4:
See the Python docs on file
objects (the object returned by open(filename)
- you can choose to read
a specified number of bytes, or use readline
to work through one line at a time.
回答5:
I agree with @jonrsharpe readline should be able to read one line at a time even for big files.
If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.
Hope it helps.