I don't want to use OS commands as that makes it is OS dependent.
This is available in tarfile
, tarfile.is_tarfile(filename)
, to check if a file is a tar file or not.
I am not able to find any relevant commands in the gzip
module.
EDIT:
Why do I need this: I have list of gzip files, these vary in sizes (1-10 GB) and some are empty. Before reading a file (using pandas.read_csv
), I want to check if the file is empty or not, because for empty files I get an error in pandas.read_csv
. (Error like: Expected 15 columns and found -1)
Sample command with error:
import pandas as pd
pd.read_csv('C:\Users\...\File.txt.gz', compression='gzip', names={'a', 'b', 'c'}, header=False)
Too many columns specified: expected 3 and found -1
pandas version is 0.16.2
file used for testing, it is just a gzip of empty file.
Unfortunately, the
gzip
module does not expose any functionality equivalent to the-l
list option of thegzip
program. But in Python 3 you can easily get the size of the uncompressed data by calling the.seek
method with awhence
argument of 2, which signifies positioning relative to the end of the (uncompressed) data stream..seek
returns the new byte position, so.seek(0, 2)
returns the byte offset of the end of the uncompressed file, i.e., the file size. Thus if the uncompressed file is empty the.seek
call will return 0.Here's a function that will work on Python 2, tested on Python 2.6.6.
You can read about
.seek
and other methods of theGzipFile
class using thepydoc
program. Just runpydoc gzip
in the shell.Alternatively, if you wish to avoid decompressing the file you can (sort of) read the uncompressed data size directly from the
.gz
file. The size is stored in the last 4 bytes of the file as a little-endian unsigned long, so it's actually the size modulo 2**32, therefore it will not be the true size if the uncompressed data size is >= 4GB.This code works on both Python 2 and Python 3.
However, this method is not reliable, as Mark Adler (gzip co-author) mentions in the comments:
Here is another solution. It does not decompress the whole file. It returns
True
if the uncompressed data in the input file is of zero length, but it also returnsTrue
if the input file itself is of zero length. If the input file is not of zero length and is not a gzip file thenOSError
is raised.If you want to check whether a file is a valid Gzip file, you can open it and read one byte from it. If it succeeds, the file is quite probably a gzip file, with one caveat: an empty file also succeeds this test.
Thus we get
However, as I stated earlier, a file which is empty (0 bytes), still succeeds this test, so you'd perhaps want to ensure that the file is not empty:
EDIT:
as the question was now changed to "a gzip file that doesn't have empty contents", then:
Try something like this:
UPDATE:
i would strongly recommend to upgrade to pandas 0.18.1 (currently the latest version), as each new version of pandas introduces nice new features and fixes tons of old bugs. And the actual version (0.18.1) will process your empty files just out of the box (see demo below).
If you can't upgrade to a newer version, then make use of @MartijnPieters recommendation - catch the exception, instead of checking (follow the Easier to ask for forgiveness than permission paradigm)
OLD answer: a small demonstration (using pandas 0.18.1), which tolerates empty files, different number of columns, etc.
I tried to reproduce your error (trying empty CSV.gz, different number of columns, etc.), but i didn't manage to reproduce your exception using pandas v. 0.18.1:
Output:
Can you post a sample CSV, causing this error or upload it somewhere and post here a link?
Looking through the source code for the Python 2.7 version of the
gzip
module, it seems to immediately return EOF, not only in the case where the gzipped file is zero bytes, but also in the case that the gzip file is zero bytes, which is arguably a bug.However, for your particular use-case, we can do a little better, by also confirming the gzipped file is a valid CSV file.
This code...
...should correctly handle the following error cases...
This should do it without reading the file.