How are version control histories stored and calcu

2020-05-26 15:38发布

问题:

Consider this simple python code, which demonstrates a very simple version control design for a dictonary:

def build_current(history):
    current = {}
    for action, key, value in history:
        assert action in ('set', 'del')
        if action == 'set':
            current[key] = value
        elif action == 'del':
            del current[key]
    return current

history = []
history.append(('set', '1', 'one'))
history.append(('set', '2', 'two'))
history.append(('set', '3', 'three'))
print build_current(history)
history.append(('del', '2', None))
history.append(('set', '1', 'uno'))
history.append(('set', '4', 'four'))
print build_current(history)
for action, key, value in history:
    if key == '2':
        print '(%s, %s, %s)' % (action, key, value)

Notice that by using the history list you can reconstruct the current dictionary in any state it once existed. I consider this a "forward build" (for lack of a better term) because to build the current dictionary one must start at the beginning and process the entire history list. I consider this the most obvious and straight forward approach.

As I've heard, early version control systems used this "forward build" process, but they were not optimal because most users care more about recent versions of a build. Also, users don't want to download the entire history when they only care about seeing the latest build.

My question then is, what other approaches exist for storing histories in a version control system? Perhaps a "backwards build" could be used? That might allow users to only download recent revisions without needing the entire history. I've also seen a few different formats for storing the history, namely: changesets, snapshots, and patches. What are the differences between changesets, snapshots and patches?

Of the modern popular version controls available, how do they store their histories and what are the advantages of their various designs?

回答1:

You mentioned these 3 methods of storing (file)-history:

  1. patch : a patch is the (usually textual, but binary patches are also possible) representation of the difference between two files. It is the output of unix command diff and can be applied by unix command patch. A lot of versioning systems are using patches to store the history of files (eg. SVN, CVS, GIT..). Sometimes these patches are technically called "delta" as the greek letter "Δ" describing the difference of two things.
  2. changeset: a changeset is a term to combine changes "which belonging together" to different files in a single entity. Not all versioning systems support changesets (most notable CVS and SourceSafe). Developer are using changesets to avoid broken builds(example: change the signature of a method in one file, change the call in a second file. You need to have both changes in place to run the program, otherwise you get an error). See also here for the difference between changeset and patch.
  3. snapshots: are full copies of the state of this file/filesystem to this point of time. They are usually quite large and their usage depends on performance characteristics. The snapshot is always redundant to a list of patches, however to retrieve information faster, sometimes Versioning Systems mix or combine patches and snapshots

Subversion uses forward deltas(aka patches) in FSFS repositories and backward deltas in BDB Repositories. Note that these implementations have different performance characteristics:

  • forward deltas are fast in committing, but slow on checkouts(as the "current" version must be reconstructed)

  • backward deltas are fast in checking out but slow on commit as new deltas must be constructed to construct the new current and rewrite the previous "current" as a bunch of deltas

Also note that FSFS uses a "skipping delta" algorithm which minimizes the jumps to reconstruct a specific version. This skipping delta however is not size optimized as mercurials snapshots; it just minimizes the number of "revisions" you need to build a full version regardless of the overall size.

Here a small ascii art (copied from the specification) of a file with 9 revisions:

0 <- 1    2 <- 3    4 <- 5    6 <- 7
0 <------ 2         4 <------ 6
0 <---------------- 4
0 <------------------------------------ 8 <- 9

where "0 <- 1" means that the delta base for revision 1 is revision 0.

The number of jumps is at most log(N) for N revisions.

Also a very pretty effect on FSFS is that older revision will be written only once and after this they will be only read by further actions. That's why subversion repositories are very stable: as long as you do not have a HW failure on your harddisk, you should be able to get a working repository even if some corruption occurred in the last commit: You still have all older revisions.

In BDB Backend you have constantly rewrite the current revision on checkins/commits, which makes this process prone to data corruption. Also as you store the full text only in current revision, corrupting the data on commit will likely destroy great parts of your history.



回答2:

I think subversion made some attempts at backwards build. But I can explain what I know better: Mercurial snapshots.

Mercurial uses a forward build scheme. But in order for each revision to be easily rebuildable, there are resync points: every time the size of all the deltas needed to rebuild a revision is bigger than two times the full text, a full text version (a compressed snapshot) is stored and all subsequent delta are computed relative to this new snapshot.

This means that you never need to read more than 3 times the size of the text to retrieve any revision.

You can find more details in the Hg Book.



回答3:

As a more generic answer, you need to differentiate CVCS (Centralized VCS, like CVS, SVN, Perforce, ClearCase, ...) from DVCS (Distributed VCS, like Git or Mercurial).
They involves different workflows and usage.

In particular, the exchange of data between a CVCS client and its server will be more important than with a DVCS (which really needs delta when pushing or pulling the all repo)

That is why delta are very important for most operations in a CVCS, a only important for certain operations and for different reasons in a DVCS.

Deltas are described in Eric Sink two books:

  • Source Control HOWTO, chapter Chapter 4: Repositories:

Repository = File System * Time

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with tree N and made one or more changes, resulting in tree N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees.

The delta sense is important (see this thread): forward delta or reverse delta.

Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way.

You can see the evolution of for the "old" VCS in Eric Raymond's Understanding Version-Control Systems.

  • Version Control by Example, Chapter 12. DVCS Internals:

Many modern version control tools use binary file deltas for repository storage.
One popular file delta algorithm is called vcdiff.
It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or text. As an ancillary benefit, the vcdiff algorithm compresses the data at the same time.

Don't forget that delta management also has an influence on the Directed Acyclic Graphs (DAGs) created for representing the history (see "Arrows direction in ProGit book" and the inconvenient behind DAG).

you can find specifics about delta management for:

  • Git: "Is the git binary diff algorithm (delta storage) standardized?"
  • Mercurial: "Mercurial: Repository Structure"
  • Veracity (which combines both approach): "Veracity: DAGs and Data"

Veracity supports two kinds of DAGs:

  • A tree DAG keeps the version history of a directory structure from a filesystem. Each node of the DAG represents one version of the whole tree.

  • A database (or “db”) DAG keeps the version history of a database, or a list of records. Each node of the DAG represents one state of the complete database.

That last points illustrates that the third (fourth?) generation of VCS must deal with distribution of not only the files (tree) but also databases (for various purposes)