Is openpyxl the fastest package to use to read big

2019-04-16 04:57发布

问题:

I'm a python beginner but I'm writing a script using openpyxl to read in big xlsx files (60000x187) into Numpy arrays to do some machine learning. My code:

from openpyxl import load_workbook
import re
from numpy import *

wb = load_workbook(filename = 'dataSheet.xlsx', use_iterators = True) #dataSheet.xlsx
ws1 = wb.get_sheet_by_name(name = 'LogFileData')

startCol = 1 #index from 1
startRow = 2 #start at least from 2 because header is in 1st row
endCol   = ws1.get_highest_column() #index of last used column, from 1
endRow   = ws1.get_highest_row() #index of last used row, indexed from 1
diff     = endRow - startRow + 1 #number of rows in the data array

header   = [] #contains the column labels
data     = zeros((0,endCol), dtype=float64) #2D array that holds the data

#puts the column headers into a list
for row in ws1.get_squared_range(1, 1, endCol, 1): #indexed from 1
    for cell in row:
        for match in re.findall("<(.*?)>", cell.value):
            header.append(match)

#indexed from 1 when using the ws1
#index from 0 when using the Numpy arrays, tempRow, tempPt, data
for index, row in enumerate(ws1.iter_rows(row_offset=1)):
    tempRow = zeros((1,0), dtype=float64)
    tempPt = zeros((1,1), dtype=float64)
    for cell in row:
        value = cell.value
        if isinstance(value, basestring):
            tempPt[0][0] = None
        else:
            tempPt[0][0]=value
        tempRow = hstack((tempRow,tempPt))
    data = vstack((data,tempRow))

Is openpyxl and the optimized_reader the fastest and most space efficient way to do this? A colleague mentioned that a csv file might be faster when used with itertools or a similar package.

Edit 1: My specs Ubuntu 10.04 LTS on VMWare Python 2.6.5 Intel i5 quad core 2.5GHz Windows 7 Enterprise

回答1:

I benchmarked the optimised reader on my 2009 MacBook at about 20s for 1 million cells with numbers. I would expect a slight hit with your code because of the indirection for cells and the pattern matching (compile the pattern outside of the loop) but would think the speed would still be acceptable. CSV will, of course, be faster if you can get it easily.

Be interested to know your numbers.



回答2:

The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.