Repeatably gzip files in python

2019-08-15 08:05发布

I'm writing a script in python for deploying static sites to aws (s3, cloudfront, route53). Because I don't want to upload every file on every deploy, I check which files were modified by comparing their md5 hash with their e-tag (which s3 sets to be the object's md5 hash). This works well for all files except for those that my build script gzips before uploading. Taking a look inside the files, it seems like gzip isn't really a pure function; there are very slight differences in the output file every time gzip is run, even if the source file hasn't changed.

My question is this: is there any way to get gzip to reliably and repeatably output the exact same file given the exact same input? Or am I better off just checking if the file is gzipped, unzipping it and computing the md5 hash/manually setting the e-tag value for it instead?

标签： python gzip

2条回答

【Aperson】

2楼-- · 2019-08-15 08:50

gzip is not stable as you figured out correctly:

[root@dev1 ~]# touch a b
[root@dev1 ~]# gzip a
[root@dev1 ~]# gzip b
[root@dev1 ~]# md5sum a.gz b.gz
8674e28eab49306b519ec7cd30128a5c  a.gz
4974585cf2e85113f1464dc9ea45c793  b.gz

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2019-08-15 09:09

The compressed data is the same each time. The only thing that differs is likely the modification time in the header. The fifth argument of GzipFile (if that's what you're using) allows you to specify the modification time in the header. The first argument is the file name, which also goes in the header, so you want to keep that the same. If you provide a fourth argument for the source data, then the first argument is used only to populate the file name portion of the header.

0人赞添加讨论(0) 举报

Repeatably gzip files in python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间