Speed up writing to files

2019-01-11 10:35发布

问题:

I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).

Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).

As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write(). For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.

Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.

fd = open(data_file, 'w')
for c, (recordid, values) in enumerate(generatevalues()):
        row = prep_row(recordid, values)
        fd.write(row)
        if c % 117 == 0:
                if limit > 0 and c >= limit:
                        break
                sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now()))
                sys.stdout.flush()

My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:

rows = []
for c, (recordid, values) in enumerate(generatevalues()):
        rows.append(prep_row(recordid, values))
        if c % 117 == 0:
            fd.write('\n'.join(rows))
            rows = []

My second thought would be to use another thread, but that makes me want to die inside.

回答1:

Actually, your problem is not that file.write() takes 20% of your time. Its that 80% of the time you aren't in file.write()!

Writing to the disk is slow. There is really nothing you can do about it. It simply takes a very large amount of time to write stuff out to disk. There is almost nothing you can do to speed it up.

What you want is for that I/O time to be the biggest part of the program so that your speed is limited by the speed of the hard disk not your processing time. The ideal is for file.write() to have 100% usage!



回答2:

Batching the writes into groups of 500 did indeed speed up the writes significantly. For this test case the writing rows individually took 21.051 seconds in I/O, while writing in batches of 117 took 5.685 seconds to write the same number of rows. Batches of 500 took a total of only 0.266 seconds.



回答3:

You can do mmap in python, which might help. But I suspect you did some mistake while profiling, because 7k * 1500 in 20 seconds is about 0.5 Mbytes/s. Do a test in which you write random lines with the same length, and you will see it's much faster than that.