Python 3: gzip.open() and modes

2019-02-24 00:21发布

问题:

https://docs.python.org/3/library/gzip.html

I am considering to use gzip.open(), and I am a little confused about the mode argument:

The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x' or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode. The default is 'rb'.

So what is the difference between 'w' and 'wb'?

The document states they are both binary mode.

So does that mean that there is no difference between 'w' and 'wb'?

回答1:

It means that r defaults to rb, and if you want text you have to specify it using rt.

(as opposed to open behaviour where r means rt, not rb)



回答2:

Exactly as you say and as already covered by @

Jean-François Fabre answer.
I just wanted to show some code, as it was fun.
Let's have a look at the gzip.py source code in the python library to see that's effectively what happens.
The gzip.open() can be found here https://github.com/python/cpython/blob/master/Lib/gzip.py and I report below

def open(filename, mode="rb", compresslevel=9,
         encoding=None, errors=None, newline=None):
    """Open a gzip-compressed file in binary or text mode.
    The filename argument can be an actual filename (a str or bytes object), or
    an existing file object to read from or write to.
    The mode argument can be "r", "rb", "w", "wb", "x", "xb", "a" or "ab" for
    binary mode, or "rt", "wt", "xt" or "at" for text mode. The default mode is
    "rb", and the default compresslevel is 9.
    For binary mode, this function is equivalent to the GzipFile constructor:
    GzipFile(filename, mode, compresslevel). In this case, the encoding, errors
    and newline arguments must not be provided.
    For text mode, a GzipFile object is created, and wrapped in an
    io.TextIOWrapper instance with the specified encoding, error handling
    behavior, and line ending(s).
    """
    if "t" in mode:
        if "b" in mode:
            raise ValueError("Invalid mode: %r" % (mode,))
    else:
        if encoding is not None:
            raise ValueError("Argument 'encoding' not supported in binary mode")
        if errors is not None:
            raise ValueError("Argument 'errors' not supported in binary mode")
        if newline is not None:
            raise ValueError("Argument 'newline' not supported in binary mode")

    gz_mode = mode.replace("t", "")
    if isinstance(filename, (str, bytes, os.PathLike)):
        binary_file = GzipFile(filename, gz_mode, compresslevel)
    elif hasattr(filename, "read") or hasattr(filename, "write"):
        binary_file = GzipFile(None, gz_mode, compresslevel, filename)
    else:
        raise TypeError("filename must be a str or bytes object, or a file")

    if "t" in mode:
        return io.TextIOWrapper(binary_file, encoding, errors, newline)
    else:
        return binary_file  

Few things we notice:

  • the default mode is rb as the documentation you report says
  • to open a binary file, it doesn't care whether it's "r", "rb", "w", "wb" for example.
    This we can see in the following lines:

    gz_mode = mode.replace("t", "")
    if isinstance(filename, (str, bytes, os.PathLike)):
        binary_file = GzipFile(filename, gz_mode, compresslevel)
    elif hasattr(filename, "read") or hasattr(filename, "write"):
        binary_file = GzipFile(None, gz_mode, compresslevel, filename)
    else:
        raise TypeError("filename must be a str or bytes object, or a file")
    
    if "t" in mode:
        return io.TextIOWrapper(binary_file, encoding, errors, newline)
    else:
        return binary_file
    

    basically the binary file binary_file gets built wether there's an additional b or not as gz_mode can have the b or not at this point.
    Now the class class GzipFile(_compression.BaseStream) is called to build binary_file.

In the constructor the following lines are important:

 if mode and ('t' in mode or 'U' in mode):
        raise ValueError("Invalid mode: {!r}".format(mode))
    if mode and 'b' not in mode:
        mode += 'b'
    if fileobj is None:
        fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
    if filename is None:
        filename = getattr(fileobj, 'name', '')
        if not isinstance(filename, (str, bytes)):
            filename = ''
    else:
        filename = os.fspath(filename)
    if mode is None:
        mode = getattr(fileobj, 'mode', 'rb')

where can be clearly seen that if 'b' is not present in the mode it will be added

if mode and 'b' not in mode:
            mode += 'b'  

so there's no distinction between the two modes as already discussed.