Writing array to Excel in Python with win32com

2020-02-29 00:41发布

问题:

I'm making a Python script which parses data files. The parsed data is then sent to an Excel file. The data can be rather huge. I'm looking at 10 to 20 columns, but the number of rows can be well over 100.000.

Writing this amount of data to Excel with win32com takes a considerable amount of time. My first step was to iterate over every cell in the Excel file, which is very time consuming. After some digging I found out how I can write a row with a single call, reducing the time needed drastically.

However, when I need to send 100.000 rows of data to Excel, it will still take a lot of time. I'm pretty sure that I can further speed this up, by sending the complete array with a single call. However I'm not able to achieve this so far.

See the code below, which demonstrates the issue. The code shows the differences in time. However, the third step, sending the complete array with a single call to a range is not resulting in the correct data in Excel. What am I doing wrong?

import win32com.client
import time

#create data array
row = range(0,10)
i = 0
data_array = []
while i < 1000:
    data_array.append(row)
    i += 1

#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sh1 = book.Worksheets(1)
sh2 = book.Worksheets(2)
sh3 = book.Worksheets(3)

#double loop, writing individual cells
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
    row += 1
    col = 0
    for field in line:
        col += 1
        sh1.Cells(row, col).Value = field
print "Processing time: " + str(time.time() - start) + " seconds."

#single loop, writing a row to a range
print "Writing with double loop to inidividual cells."
start = time.time()
row = 0
for line in data_array:
    row += 1
    sh2.Range(sh2.Cells(row,1), sh2.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."

#no loop, write array to range
print "Writing with double loop to inidividual cells."
start = time.time()
try:
    sh3.Range(sh3.Cells(row,1), sh3.Cells(len(data_array), len(data_array[0]))).Value = data_array
    print "Processing time: " + str(time.time() - start) + " seconds."
except:
    print "Failed to write array to excel file."

excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None

回答1:

I have been doing some research on this. And have been coming to some interesting conclusions.

There are indeed multiple solutions around to write data to excel from python. I ended up focussing on three modules.

win32com.client

Works slow. However the document can be opened in excel. So the end result is available for the user in excel to start working on. Not great for large amounts of data.

Writing 10,000 rows with 10 columns takes on my computer (core i5) app. 70 seconds.

import win32com.client
import time

#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1

#write the array to an excel file
excel = win32com.client.Dispatch("Excel.Application")
excel.Visible = True
excel.ScreenUpdating = False
book = excel.Workbooks.Add()
sheet = book.Worksheets(1)

#single loop, writing a row to a range
print "Writing using win32com.client"
start = time.time()
row = 0
for line in data_array:
    row += 1
    sheet.Range(sheet.Cells(row,1), sheet.Cells(row, len(line))).Value = line
print "Processing time: " + str(time.time() - start) + " seconds."

print "Completed: " + str(time.time() - start) + " seconds."
excel.ScreenUpdating = True
sheet = None
book = None
excel.Quit()
excel = None

openpyxl

Is a bit faster, but still not great. This module is slow on transfering the data to the excel memory object, but save it really fast. It creates the 10,000 lines with 10 columns in 22.3 seconds, and saves the file in another 0.5 seconds. When I test this with 100,000 rows and 10 columns. The data is created in 228.3 seconds, and saving the file is done in another 2.9 seconds. Fairly slow, but the file saving is fast. Hence openpyxl is probably good for making changes (formatting) to existing data, although I have not testing this yet. Another advantage is that coding with openpyxl is easier then using win32com.client.

import openpyxl
import sys
import time

#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1

#create an excel workbook and sheet object
book = openpyxl.Workbook(optimized_write = True)

#single loop, writing rows
start = time.time()
print "Writing with single loop using openpyxl"
sheet = book.create_sheet()
for line in data_array:
    sheet.append(line)
print "Processing time: " + str(time.time() - start) + " seconds."

#close and save the file.
book.save('openpyxl.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."

I am facing another issue with openpyxl. In my real tool, openpyxl is having problems with saving big amounts of data (> 10,000 rows). This I have not figured out yet, and maybe I won't bother to look further in to it.

PyExcelerate

This one is first of all fast. It creates the 10,000 rows and 10 columns in just 0.17 seconds. Saving the file however takes 2.2 seconds. By far the fastest option of the three. When I try to save 100,000 rows and 10 columns with this module, the excel data is created in just 1.8 seconds. However saving the file now takes 21.7 seconds. Hence, this module is really fast, but there is a penalty in writing the file. Overall still the fastest of them all. Another advantage of PyExcelerate is that coding is very easy, easier again as openpyxl.

import pyexcelerate
import sys
import time

#create data array
row = range(0,10)
i = 0
data_array = []
while i < 10000:
    data_array.append(row)
    i += 1

print sys.version

#create an excel workbook and sheet object
book = pyexcelerate.Workbook()

#single loop, writing rows
start = time.time()
print "Writing with single loop using PyExcelerate"
book.new_sheet("data", data = data_array)

print "Processing time: " + str(time.time() - start) + " seconds."

#close and save the file.
book.save('pyexcelerate.xlsx')
print "Total time: " + str(time.time() - start) + " seconds."

So my conclusion is that PyExcelerate is by far the fastest. The advantage of win32com.client is that the created excel file can be openend in excel to make the created data available for the user to start working with it. Openpyxl could be interesting to make style changes after creating it. However, I have not tested this yet. Hence, it could be beneficial to combine win32com.client, openpyxl and PyExcelerate in one application.



回答2:

Using the COM for reading files from a excel file is a extreme waste of time. It's like killing flyes with a tank. Take into account that the win32com does complicated calls with the windows API, that talk to excel, retrieves the data and sends it back to python. Why do that when the information is already there as a file?

There are libraries that parse directly the excel file, and as you can imagine they can be x100 times faster, since there is no over-complex calls to the win API.

I've worked a lot successfully with openpyxl, but there are others libraries out there that can be as good or even better.

Just an example for huge data (uses generators instead of loading everything into memory):

from openpyxl import load_workbook
wb = load_workbook(filename='large_file.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name(name='big_data') # ws is now an IterableWorksheet

for row in ws.iter_rows(): # it brings a new method: iter_rows()
     for cell in row:
         print cell.internal_value

Equivalent methods are available to write into cells. You can even format them, although it is not (or used to be) very complete.

EDIT

Example on how to write a big amount of info to a xlsx file:

from openpyxl import Workbook
from openpyxl.cell import get_column_letter

wb = Workbook()
dest_filename = r'empty_book.xlsx'
ws = wb.active
ws.title = "range names"

for col_idx in xrange(1, 40):
    col = get_column_letter(col_idx)
    for row in xrange(1, 600):
        ws.cell('%s%s'%(col, row)).value = '%s%s' % (col, row)

ws = wb.create_sheet()
ws.title = 'Pi'
ws['F5'] = 3.14
wb.save(filename=dest_filename)


回答3:

I know the post is a bit old.
However, being stuck with win32com for some other reason (macro executions) and needing a similar solution to your 3rd sheet way (complete array at once), I tried the code of your initial post and found a small mistake that prevented it to work.
So to answer your original question "What am I doing wrong?":
You forgot to re-initialize the row variable to 1 before using it in the '#no loop, write array to range' part.
Here is the win32com part of your comparison post updated then:

print "Writing using win32com.client"
start = time.time()
row = 1
sheet.Range(sheet.Cells(row,1), sheet.Cells(row+len(data_array)-1, len(data_array[0]))).Value = data_array
print "Processing time: " + str(time.time() - start) + " seconds."

book.SaveAs(Filename="C:\Temp\Temp.xls", FileFormat=56)
print "Completed: " + str(time.time() - start) + " seconds."

Then, timing is no so bad:
Writing using win32com.client
Processing time: 0.322000026703 seconds.
Completed: 1.73699998856 seconds.
Pretty fast then. Maybe your comparison can be updated since the PyExcelerate numbers are not so different anymore (and my computer is slower).