Python pandas cannot read old excel files with som

2019-07-21 01:33发布

问题:

I am trying to import an old excel file into pandas. The file is generated by the cropSyst software using an old format (which I don't know) and it has split panes by default. This is a snapshot

A solution could be to open the file with excel and save it in xlsx: this way pandas can read it with no problem, but I have to deal with a large number of files (> 40000) so it's not feasible.

when trying to import it with pandas

pd.read_excel(filename)

I get this error

error: unpack requires a string argument of length 3
*** No CODEPAGE record, no encoding_override: will use 'ascii'

I tried also to import it as csv

pd.read_csv(filename)

but it reads only some data, I tried using different combinations of line separators and cell delimiters with no luck.

do you have some suggestions?

EDIT

As suggested by @jmcnamara in the comments I tried to override the encoding of the file using all the encodings available in Python, but still no luck: it does not parse correctly the file.

Here is possible to download one of the xls files

https://drive.google.com/file/d/0B5smcFuBd6EtdFE5eEktZXZVTkU/view?usp=sharing

回答1:

The format of the file is Excel 2 (BIFF2).

However, since it wasn't created by Excel it seems to contain inconsistencies with the Excel BIFF2 spec.

For the file you show the WINDOW2 record is incorrect. You can workaround this by getting the current version of xlrd (0.9.3) and applying the following patch:

diff --git a/xlrd/sheet.py b/xlrd/sheet.py
index 36438a0..6d895c4 100644
--- a/xlrd/sheet.py
+++ b/xlrd/sheet.py
@@ -1455,7 +1455,8 @@ class Sheet(BaseObject):
                     (self.first_visible_rowx, self.first_visible_colx,
                     self.automatic_grid_line_colour,
                     ) = unpack("<HHB", data[5:10])
-                    self.gridline_colour_rgb = unpack("<BBB", data[10:13])
+                    if data_len > 10:
+                        self.gridline_colour_rgb = unpack("<BBB", data[10:13])
                     self.gridline_colour_index = nearest_colour_index(
                         self.book.colour_map, self.gridline_colour_rgb, debug=0)
                     self.cached_page_break_preview_mag_factor = 0 # default (60%)

Then install this version of the module or use it from your PYTHONPATH since pandas uses xlrd to read Excel files.

This still gives the Codepage warning but that is only a warning and you can use encoding_override='ascii' (or whatever the correct encoding is but ascii is probably right).

Note, there may other issues in the file format, given that you have 40,000 files, but that was the only one I found in the file you provided.

Update: Based on the second example file it looks like the encoding is Windows CP-1252 so the following should work:

import xlrd 

workbook = xlrd.open_workbook('harvest.xls', encoding_override='cp1252') 


回答2:

I've previously used this with success to open old Excel files, give it a try:

http://www.python-excel.org/