I've profiled some legacy code I've inherited with cProfile. There were a bunch of changes I've already made that have helped (like using simplejson's C extensions!).
Basically this script is exporting data from one system to an ASCII fixed-width file. Each row is a record, and it has many values. Each line is 7158 characters and contains a ton of spaces. The total record count is 1.5 million records. Each row is generated one at a time, and takes a while (5-10 rows a second).
As each row is generated it's written to disk as simply as possible. The profiling indicates that about 19-20% of the total time is spent in file.write()
. For a test case of 1,500 rows that's 20 seconds. I'd like to reduce that number.
Now it seems the next win will be reducing the amount of time spent writing to disk. I'd like to reduce it, if possible. I can keep a cache of records in memory, but I can't wait until the end and dump it all at once.
fd = open(data_file, 'w')
for c, (recordid, values) in enumerate(generatevalues()):
row = prep_row(recordid, values)
fd.write(row)
if c % 117 == 0:
if limit > 0 and c >= limit:
break
sys.stdout.write('\r%s @ %s' % (str(c + 1).rjust(7), datetime.now()))
sys.stdout.flush()
My first thought would be to keep a cache of records in a list and write them out in batches. Would that be faster? Something like:
rows = []
for c, (recordid, values) in enumerate(generatevalues()):
rows.append(prep_row(recordid, values))
if c % 117 == 0:
fd.write('\n'.join(rows))
rows = []
My second thought would be to use another thread, but that makes me want to die inside.