Consider this simple python code, which demonstrates a very simple version control design for a dictonary:
def build_current(history):
current = {}
for action, key, value in history:
assert action in ('set', 'del')
if action == 'set':
current[key] = value
elif action == 'del':
del current[key]
return current
history = []
history.append(('set', '1', 'one'))
history.append(('set', '2', 'two'))
history.append(('set', '3', 'three'))
print build_current(history)
history.append(('del', '2', None))
history.append(('set', '1', 'uno'))
history.append(('set', '4', 'four'))
print build_current(history)
for action, key, value in history:
if key == '2':
print '(%s, %s, %s)' % (action, key, value)
Notice that by using the history list you can reconstruct the current dictionary in any state it once existed. I consider this a "forward build" (for lack of a better term) because to build the current dictionary one must start at the beginning and process the entire history list. I consider this the most obvious and straight forward approach.
As I've heard, early version control systems used this "forward build" process, but they were not optimal because most users care more about recent versions of a build. Also, users don't want to download the entire history when they only care about seeing the latest build.
My question then is, what other approaches exist for storing histories in a version control system? Perhaps a "backwards build" could be used? That might allow users to only download recent revisions without needing the entire history. I've also seen a few different formats for storing the history, namely: changesets, snapshots, and patches. What are the differences between changesets, snapshots and patches?
Of the modern popular version controls available, how do they store their histories and what are the advantages of their various designs?
You mentioned these 3 methods of storing (file)-history:
Subversion uses forward deltas(aka patches) in FSFS repositories and backward deltas in BDB Repositories. Note that these implementations have different performance characteristics:
forward deltas are fast in committing, but slow on checkouts(as the "current" version must be reconstructed)
backward deltas are fast in checking out but slow on commit as new deltas must be constructed to construct the new current and rewrite the previous "current" as a bunch of deltas
Also note that FSFS uses a "skipping delta" algorithm which minimizes the jumps to reconstruct a specific version. This skipping delta however is not size optimized as mercurials snapshots; it just minimizes the number of "revisions" you need to build a full version regardless of the overall size.
Here a small ascii art (copied from the specification) of a file with 9 revisions:
where "0 <- 1" means that the delta base for revision 1 is revision 0.
The number of jumps is at most log(N) for N revisions.
Also a very pretty effect on FSFS is that older revision will be written only once and after this they will be only read by further actions. That's why subversion repositories are very stable: as long as you do not have a HW failure on your harddisk, you should be able to get a working repository even if some corruption occurred in the last commit: You still have all older revisions.
In BDB Backend you have constantly rewrite the current revision on checkins/commits, which makes this process prone to data corruption. Also as you store the full text only in current revision, corrupting the data on commit will likely destroy great parts of your history.
I think subversion made some attempts at backwards build. But I can explain what I know better: Mercurial snapshots.
Mercurial uses a forward build scheme. But in order for each revision to be easily rebuildable, there are resync points: every time the size of all the deltas needed to rebuild a revision is bigger than two times the full text, a full text version (a compressed snapshot) is stored and all subsequent delta are computed relative to this new snapshot.
This means that you never need to read more than 3 times the size of the text to retrieve any revision.
You can find more details in the Hg Book.
As a more generic answer, you need to differentiate CVCS (Centralized VCS, like CVS, SVN, Perforce, ClearCase, ...) from DVCS (Distributed VCS, like Git or Mercurial).
They involves different workflows and usage.
In particular, the exchange of data between a CVCS client and its server will be more important than with a DVCS (which really needs delta when pushing or pulling the all repo)
That is why delta are very important for most operations in a CVCS, a only important for certain operations and for different reasons in a DVCS.
Deltas are described in Eric Sink two books:
The delta sense is important (see this thread): forward delta or reverse delta.
You can see the evolution of for the "old" VCS in Eric Raymond's Understanding Version-Control Systems.
Don't forget that delta management also has an influence on the Directed Acyclic Graphs (DAGs) created for representing the history (see "Arrows direction in ProGit book" and the inconvenient behind DAG).
you can find specifics about delta management for:
A tree DAG keeps the version history of a directory structure from a filesystem. Each node of the DAG represents one version of the whole tree.
A database (or “db”) DAG keeps the version history of a database, or a list of records. Each node of the DAG represents one state of the complete database.
That last points illustrates that the third (fourth?) generation of VCS must deal with distribution of not only the files (tree) but also databases (for various purposes)