What's the most compact way to store diffs in

2020-07-11 04:50发布

I want to implement something similar to Wikimedia's revision history? What would be the best PHP functions/libraries/extensions/algorithms to use?

I would like the diffs to be as compact as possible, but I'm happy to be restricted to only showing the difference between each revision and its sibling, and only being able to roll back one revision at a time.

In some cases only a few characters may change, whereas in other cases the whole string could change, so I'm keen to understand whether some techniques are better for small changes than for large ones, and if in some cases it's more efficient to simply store whole copies.

Backing the whole system with something like Git or SVN seems a bit extreme, and I don't really want to store files on disk.

3条回答
老娘就宠你
2楼-- · 2020-07-11 05:38

I would implement it using diff to create the delta and patch to apply one or more edits in sequence to build a document at a known state. Of course, the more you do this more it becomes clear that you can offload this task to a version control tool. I have twice re-designed diff/patch systems to use SVN for this type of task.

查看更多
我只想做你的唯一
3楼-- · 2020-07-11 05:53

You must ask yourself: what type of data end user will want to retrieve more often: revisions, or diffs of revisions? I would use standard diff from unix for that. And, depending on the answer of above question, store diffs or whole revisions in database.

Backing the whole system with something like Git or SVN seems a bit extreme

Why? Github, AFAIR, stores wikis that way ;)

查看更多
贼婆χ
4楼-- · 2020-07-11 05:56

It is much easier to store each record in its entirety than it is to store diffs of them. Then if you want a diff of two revisions you can generate one as needed using the PECL Text_Diff library.

I like to store all versions of the record in a single table and retrieve the most recent one with MAX(revision), a "current" boolean attribute, or similar. Others prefer to denormalize and have a mirror table that holds non-current revisions.

If you store diffs instead, your schema and algorithms become much more complex. You then need to store at least one "full" revision and multiple "diff" versions, and reconstruct a full version from a set of diffs whenever you need a full version. (This is how SVN stores things. Git stores a full copy of each revision, not diffs.)

Programmer time is expensive, but disk space is usually cheap. Please consider whether storing each revision in full is really a problem.

查看更多
登录 后发表回答