uncompressing zipped data files before committing

2019-04-29 05:54发布

Does it make any sense to somehow store an "uncompressed" version of normally-compressed files in the repository?

If so, is there a standard way to implement this? (Perhaps a standard pre-commit hook that uncompresses each such file into a specially-named folder; and a post-checkout hook that compresses such specially-named folders into the compressed files that LibreOffice knows how to read and write? Something like the process described by "Should I decompress zips before I archive?" ?) (Perhaps hacking the code of the version control software to automagically decompress the old version and the new version and storing the diff between the decompressed files, and if that fails or doesn't offer a significant improvement, fall back on the original system of storing the direct diff between the original files, or simply storing the file directly?)

I have a collection of OpenOffice / LibreOffice files that are frequently edited. I am storing them in a version-control repository -- as recommended by "Should images be stored in a git repository?". Although I happen to be using TortoiseHg or SourceTree to access my repositories, rather than git.

I happen to know that Open Office files are actually zip-compressed container with a few XML files inside. (I hear that many other popular application "binary file formats" are also some form of zip-compressed file).

My understanding is that even the smallest change to such "binary" files leads to the entire new file stored in the repository. As opposed to small changes in "text" files, which leads to only the changes being stored and transmitted.

In theory, that would have the advantages of:

  • Where the change is only a few words, I could see the exact words that changed in the "diff" view in the change log. (Rather than the non-informative "binary file changed" message).
  • When several different people independently edit version 14 of a file, it's much easier to merge all of their improvements into version 16 of the file without regression.
  • faster synchronization to the remote repository -- only short "changes" need to be transmitted, rather than the entire (compressed) file.
  • possibly smaller repository, in terms of disk space -- after a few hundred changes, I expect a relatively small repository that only contains a few hundred small changes, rather than a relatively large repository that contains a few hundred complete copies of these files. (I list this advantage last, because it is nearly irrelevant in these days of cheap disk space).

1条回答
Root(大扎)
2楼-- · 2019-04-29 06:22

Does it make any sense to somehow store an "uncompressed" version of normally-compressed files in the repository?

It makes sense especially if you need branching and diff'ing.

This old thread summarizes the situation.

  1. For Openoffice documents whose size is dominated by embed images and other large objects, the git delta mechanism already performs reasonably well, since OO files are Zip archives where each file is compressed separately.
    If you do not change an image, then that image remains stored in the same way and the delta can be done.
  2. For OO documents whose size is dominated by plain content, the git delta mechanism cannot work, since the zip compression introduces "mixing" and a small change in the document is converted into a very large change in the zip file.

It could be possible to write a clean filter to uncompress before commit.
However there is a trick with the complementary smudge filter to be used at checkout. If you do not smudge properly, git always shows the file as changed wrt the index.
Smudging correctly would mean using the very same compression ratio and compress method that OO uses, which can be a little tricky. I have tried using the zip binary both in the clean and the smudge phases and it does not work nicely. The smudged file is always different from the original one.
One should probably work at a lower level to have a finer control on what is happening (libzip) and prepend to the uncompressed file the compression parameters to be restored on smudging.

The bigger issue is however that the clean/smudge thing can be really slow when dealing with large OO files.

查看更多
登录 后发表回答