Repeatably gzip files in python

2019-08-15 08:29发布

问题:

I'm writing a script in python for deploying static sites to aws (s3, cloudfront, route53). Because I don't want to upload every file on every deploy, I check which files were modified by comparing their md5 hash with their e-tag (which s3 sets to be the object's md5 hash). This works well for all files except for those that my build script gzips before uploading. Taking a look inside the files, it seems like gzip isn't really a pure function; there are very slight differences in the output file every time gzip is run, even if the source file hasn't changed.

My question is this: is there any way to get gzip to reliably and repeatably output the exact same file given the exact same input? Or am I better off just checking if the file is gzipped, unzipping it and computing the md5 hash/manually setting the e-tag value for it instead?

回答1:

The compressed data is the same each time. The only thing that differs is likely the modification time in the header. The fifth argument of GzipFile (if that's what you're using) allows you to specify the modification time in the header. The first argument is the file name, which also goes in the header, so you want to keep that the same. If you provide a fourth argument for the source data, then the first argument is used only to populate the file name portion of the header.



回答2:

gzip is not stable as you figured out correctly:

[root@dev1 ~]# touch a b
[root@dev1 ~]# gzip a
[root@dev1 ~]# gzip b
[root@dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c  a.gz
4974585cf2e85113f1464dc9ea45c793  b.gz


标签: python gzip