How can I split a large file csv file (7GB)

2020-02-02 08:35发布

I have a 7GB csv file which I'd like to split into smaller chunks, so it is readable and faster for analysis in Python on a notebook. I would like to grab a small set from it, maybe 250MB, so how can I do this?

标签: python csv split
5条回答
家丑人穷心不美
2楼-- · 2020-02-02 08:40

I agree with @jonrsharpe readline should be able to read one line at a time even for big files.

If you are dealing with big csv files might I suggest using pandas.read_csv. I often use it for the same purpose and always find it awesome (and fast). Takes a bit of time to get used to idea of DataFrames. But once you get over that it speeds up large operations like yours massively.

Hope it helps.

查看更多
我只想做你的唯一
3楼-- · 2020-02-02 08:48

See the Python docs on file objects (the object returned by open(filename) - you can choose to read a specified number of bytes, or use readline to work through one line at a time.

查看更多
淡お忘
4楼-- · 2020-02-02 08:56

You don't need Python to split a csv file. Using your shell:

$ split -l 100 data.csv

Would split data.csv in chunks of 100 lines.

查看更多
疯言疯语
5楼-- · 2020-02-02 09:02

Maybe something like this?

#!/usr/local/cpython-3.3/bin/python

import csv

divisor = 10

outfileno = 1
outfile = None

with open('big.csv', 'r') as infile:
    for index, row in enumerate(csv.reader(infile)):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
        writer.writerow(row)
查看更多
做个烂人
6楼-- · 2020-02-02 09:04

I had to do a similar task, and used the pandas package:

for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
    chunk.to_csv('chunk{}.csv'.format(i), index='False')
查看更多
登录 后发表回答