Python zipfile module - zipfile.write() file with

2019-07-14 05:52发布

问题:

On my system there are many Word documents and I want to zip them using the Python module zipfile.

I have found this solution to my problem, but on my system there are files which contain German umlauts and Turkish characters in their filename.

I have adapted the method from the solution like this, so it can process German umlauts in the filenames:

def zipdir(path, ziph):
    for root, dirs, files in os.walk(path):
        for file in files:
            current_file = os.path.join(root, file)
            print "Adding to archive -> file: "+str(current_file)
            try:
                #ziph.write(current_file.decode("cp1250")) #German umlauts ok, Turkish chars not ok
                ziph.write(current_file.encode("utf-8")) #both not ok
                #ziph.write(current_file.decode("utf-8")) #both not ok
            except Exception,ex:
                print "exception ---> "+str(ex)
                print repr(current_file)
                raise

Unfortunately my attempts to include logic for Turkish characters remained unsuccessful, leaving the problem that every time a filename contains a Turkish character the code prints an exception, for example like this:

exception ---> [Error 123] Die Syntax f³r den Dateinamen, Verzeichnisnamen oder
die Datentrõgerbezeichnung ist falsch: u'X:\\my\\path\\SomeTurk?shChar?shere.doc'

I have tried several string encode-decode stuff, but none of it was successful.

Can someone help me out here?


I edited the above code to include the changes mentioned in the comment.

The following errors are now shown:

...
Adding to archive -> file: X:\\my\path\blabla I blabla.doc
Adding to archive -> file: X:\my\path\bla bla³bla³bla³bla.doc
exception ---> 'ascii' codec can't decode byte 0xfc in position 24: ordinal not
in range(128)
'X:\\my\\path\\bla B\xfcbla\xfcbla\xfcbla.doc'
Traceback (most recent call last):
  File "Backup.py", line 48, in <module>
    zipdir('X:\\my\\path', zipf)
  File "Backup.py", line 12, in zipdir
    ziph.write(current_file.encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 24: ordinal
 not in range(128)

The ³ is actually a German ü.


EDIT

After trying the suggested things in the comments, I could not work out a solution.

Therefore I switched to the Groovy Programming Language and used its Zip-Capabilities.

As this is a opinion-based discussion, I have decided to vote for closing the thread.

回答1:

If you do not need to inspect the ZIP file with any archiver later, you may always encode it to base64, and then restore them when extracting with Python.

To any archiver these filenames will look like gibberish but encoding will be preserved.

Anyway, to get the 0-128 ASCII range string (or bytes object in Py3), you have to encode(), not decode().

encode() serializes the unicode() string to ASCII range.

>>> u"\u0161blah".encode("utf-8")
'\xc5\xa1blah'

decode() returns from that to unicode():

>>> "\xc5\xa1blah".decode("utf-8")
u'\u0161blah'

Same goes for any other codepage.

Sorry for emphasizing that, but people sometimes get confused about encoding and decoding stuff.

If you need files, but you arent concerned much about preserving umlautes and other symbols, you can use:

u"üsdlakui".encode("utf-8", "replace")

or:

u"üsdlakui".encode("utf-8", "ignore")

This will replace unknown characters with possible ones or totally ignore any decoding/encoding errors.

That will fix things if the raised error is something like UnicodeDecodeError: Cannot decode character ...

But, the problem will be with filenames consisting only of non-latin characters.

Now something that might actually work:

Well,

'Sömethüng'.encode("utf-8")

is bound to raise "ASCII encode error" as there is no unicode characters defined in the string while non-latin characters that othervise should be used to describe unicode/UTF-8 character are used but defined as ASCII - file itself is not UTF-8 encoded.

while:

# -*- coding: UTF-8 -*-
u'Sömethüng'.encode("utf-8")

or

# -*- coding: UTF-8 -*-
unicode('Sömethüng').encode("utf-8")

with encoding defined on top of file and saved as UTF-8 encoded should work.

Yes, you do have strings from OS (filename), but that is a problem from beginning of the story.

Even if encoding passes right, there is the ZIP thing still to be solved.

By specification ZIP should store filenames using CP437, but this is rarely so.

Most archivers use the default OS encoding (MBCS in Python).

And most archivers doesn't support UTF-8. So, what I propose here should work, but not on all archivers.

To tell the ZIP archiver that archive is using UTF-8 filenames, the eleventh bit of flag_bits should be set to True. As I said, some of them does not check that bit. This is recent thing in ZIP spec. (Well, few years ago really)

I won't write here whole code, just the part needed to understand the thing.

# -*- coding: utf-8 -*-
# Cannot hurt to have default encoding set to UTF-8 all the time. :D

import os, time, zipfile
zip = zipfile.ZipFile(...)
# Careful here, origname is the full path to the file you will store into ZIP
# filename is the filename under which the file will be stored in the ZIP
# It'll probably be better if filename is not a full path, but relative, not to introduce problems when extracting. You decide.
filename = origname = os.path.join(root, filename)
# Filenames from OS can be already UTF-8, but they can be a local codepage.
# I will use MBCS here to decode from it, so that we can encode to UTF-8 later.
# I recommend getting codepage from OS (from kernel32.dll on Windows) manually instead of using MBCS, but for now:
if isinstance(filename, str): filename = filename.decode("mbcs")
# Else, assume it is already a decoded unicode string.
# Prepare the filename for archive:
filename = os.path.normpath(os.path.splitdrive(filename)[1])
while filename[0] in (os.sep, os.altsep):
    filename = filename[1:]
filename = filename.replace(os.sep, "/")
filename = filename.encode("utf-8") # Get what we need
zinfo = zipfile.ZipInfo(filename, time.localtime(os.getmtime(origname))[0:6])
# Here you should set zinfo.external_attr to store Unix permission bits and set the zinfo.compression_type
# Both are optional and not a subject to your problem. But just as notice.
zinfo.flag_bits |= 0x800 # Set 11th bit to 1, announce the UTF-8 filenames.
f = open(origname, "rb")
zip.writestr(zinfo, f.read())
f.close()

I didn't test it, just wrote a code, but this is an idea, even if somewhere crept in some bug.

If this doesn't work, I don't know what will.