I've been looking like crazy for an explanation of a diff algorithm that works and is efficient.
The closest I got is this link to RFC 3284 (from several Eric Sink blog posts), which describes in perfectly understandable terms the data format in which the diff results are stored. However, it has no mention whatsoever as to how a program would reach these results while doing a diff.
I'm trying to research this out of personal curiosity, because I'm sure there must be tradeoffs when implementing a diff algorithm, which are pretty clear sometimes when you look at diffs and wonder "why did the diff program chose this as a change instead of that?"...
Where can I find a description of an efficient algorithm that'd end up outputting VCDIFF?
By the way, if you happen to find a description of the actual algorithm used by SourceGear's DiffMerge, that'd be even better.
NOTE: longest common subsequence doesn't seem to be the algorithm used by VCDIFF, it looks like they're doing something smarter, given the data format they use.
Based on the link Emmelaich gave, there is also a great run down of Diff Strategies on Neil Fraser's website (one of the authors of the library).
He covers basic strategies and towards the end of the article progresses to Myer's algorithm and some graph theory.
I would begin by looking at the actual source code for diff, which GNU makes available.
For an understanding of how that source code actually works, the docs in that package reference the papers that inspired it:
Reading the papers then looking at the source code for an implementation should be more than enough to understand how it works.
I came here looking for the diff algorithm and afterwards made my own implementation. Sorry I don't know about vcdiff.
Wikipedia: From a longest common subsequence it's only a small step to get diff-like output: if an item is absent in the subsequence but present in the original, it must have been deleted. (The '–' marks, below.) If it is absent in the subsequence but present in the second sequence, it must have been added in. (The '+' marks.)
Nice animation of the LCS algorithm here.
Link to a fast LCS ruby implementation here.
My slow and simple ruby adaptation is below.
See http://code.google.com/p/google-diff-match-patch/
Also see the wikipedia.org Diff page and - "Bram Cohen: The diff problem has been solved"
An O(ND) Difference Algorithm and its Variations is a fantastic paper and you may want to start there. It includes pseudo-code and a nice visualization of the graph traversals involved in doing the diff.
Section 4 of the paper introduces some refinements to the algorithm that make it very effective.
Successfully implementing this will leave you with a very useful tool in your toolbox (and probably some excellent experience as well).
Generating the output format you need can sometimes be tricky, but if you have understanding of the algorithm internals, then you should be able to output anything you need. You can also introduce heuristics to affect the output and make certain tradeoffs.
Here is a page that includes a bit of documentation, full source code, and examples of a diff algorithm using the techniques in the aforementioned algorithm.
The source code appears to follow the basic algorithm closely and is easy to read.
There's also a bit on preparing the input, which you may find useful. There's a huge difference in output when you are diffing by character or token (word).
Good luck!