As far as I know all distributed revision control systems require you to clone the whole repository. For this reason is it not wise to put huge amounts of content into one single repository (thanks for this answer). I know that this a not a bug but a feature, but I wonder whether this is a requirement for all distributed revision control systems.
In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the repository? Maybe I miss something but the following use-cases are hard to do:
- clone only a part of a repository
- merge two repositories (preserving their histories!)
- copy some files with their history from one repository to another
If I reuse parts of other people's code from multiple projects I cannot preserve their full history. At least in git I can think of a (rather complex) workaround:
- clone a full repository
- delete all content that I am not interested in
- rewrite the history to delete everything that is not in the master
- merge the remaining repository into an existing repository
I don't know if this is also possible with Mercurial or Bazaar but at least it is not easy at all. So is there any distributed rcs that supports partial checkout/clone by design? It should support one simple command to get a single file with its history from one repository and merge it into another. This way you would not need to think about how to structure your content into repositories and submodules but you could happily split and merge repositories as needed (the extreme would be one repository for each single file).
As of version 2.0, it is not possible to make a so-called "narrow clone" with Mercurial, that is, a clone where you only retrieve data for a specific sub-directory. We call it a "shallow clone" when you only retrieve part of the history, say, the last 100 revisions.
As you say, there is nothing in the common DAG-based history model that excludes this feature and we have been working on it. Peter Arrenbrecht, a Mercurial contributor, has implemented two different approaches for narrow clones, but neither approach has been merged yet.
Btw, you can of course split an existing Mercurial repository into pieces where each smaller repository only has the history for a specific sub-directory of the original repository. The convert extension is the tool for this. Each of the smaller repositories will be unrelated to the bigger repository, though — the tricky part is to make the splitting seamless so that the changesets keep their identities.
There's a subtree module for git, allowing you to split off a portion of a repository into a new repo and then merge changes to/from the original and the subtree. Here's its readme on github: http://github.com/apenwarr/git-subtree/blob/master/git-subtree.txt
In distributed rcs the history of a file (or a chunk of content) is a directed acyclic graph, so why can't you just clone this single DAG instead of the set of all graphs in the a repository?
At least in Git, the DAG representing the repository history applies to the whole repository, not just a single file. Each commit object points to a "tree" object which represents the entire state of the repository at that time.
Git 1.7 supports "sparse checkouts", which allow you to restrict the size of your working copy. The entire repository data is still cloned, however.
In bazaar you can split and join parts of a repository.
The split-command allows you to split a repository into multiple repositories. The join-command allows you to merge repositories. Both keep the history.
However this isn't as handy a the SVN-model, where you can checkout/commit for a sub-tree.
There's a planned feature called Nested-Trees for bazaar, which maybe would allow partial checkouts.
I hope one of these RCS's will add narrow clone capability. My understanding is that the architecture of GIT (changes/moves tracked across the whole repo) makes this very difficult.
Bazaar prides itself on supporting many different types of workflows. Lack of narrow clone capability prohibits an SVN/CVS like workflow in bzr/hg/git, so I'm hoping they'll be motivated to find some way to do this.
New features shouldn't come at the expense of basic functionality, like the ability to fetch a single file/directory from the repo. The "distributed" feature of modern rcs's is "cool," but in my opinion discourages good development practices (frequent merges / continuous integration). These new RCS's all seem to lack very basic functionality. Even SVN without real branching/tagging support seemed like a step backwards from CVS imo.
As of Git 2.17 (Q2 2018, 10 years later), it will be possible to do what Mercurial planned to implement: a "narrow clone", that is, a clone where you only retrieve data for a specific sub-directory.
This is also called "partial clone".
That differs from the current
- shallow clone
- copy of what you need from the cloned repo in another working folder.
See commit 3aa6694, commit aa57b87, commit 35a7ae9, commit 1e1e39b, commit acb0c57, commit bc2d0c3, commit 640d8b7, commit 10ac85c (08 Dec 2017) by Jeff Hostetler (jeffhostetler
).
See commit a1c6d7c, commit c0c578b, commit 548719f, commit a174334, commit 0b6069f (08 Dec 2017) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 6bed209, 13 Feb 2018)
Here are the tests for a partial clone:
git clone --no-checkout --filter=blob:none "file://$(pwd)/srv.bare" pc1
There other other commits involved in that implementation of a narrow/partial clone.
In particular, commit 8b4c010:
sha1_file: support lazily fetching missing objects
Teach sha1_file
to fetch objects from the remote configured in
extensions.partialclone
whenever an object is requested but missing.
Warning regarding Git 2.17/2.18: The recent addition of "partial clone" experimental feature kicked in when it shouldn't, namely, when there is no partial-clone filter defined even if extensions.partialclone
is set.
See commit cac1137 (11 Jun 2018) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 92e1bbc, 28 Jun 2018)
upload-pack
: disable object filtering when disabled by config
When upload-pack
gained partial clone support (v2.17.0-rc0~132^2~12,
2017-12-08), it was guarded by the uploadpack.allowFilter
config item
to allow server operators to control when they start supporting it.
That config item didn't go far enough, though: it controls whether the
'filter
' capability is advertised, but if a (custom) client ignores
the capability advertisement and passes a filter specification anyway,
the server would handle that despite allowFilter being false.
This is particularly significant if a security bug is discovered in
this new experimental partial clone code.
Installations without uploadpack.allowFilter
ought not to be affected since they don't intend to support partial clone, but they would be swept up into being
vulnerable.
This is enhanced with Git 2.20 (Q2 2018), since "git fetch $repo $object
" in a partial clone did not correctly fetch the asked-for object that is referenced by an object in promisor packfile, which has been fixed.
See commit 35f9e3e, commit 4937291 (21 Sep 2018) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit a1e9dff, 19 Oct 2018)
fetch
: in partial clone, check presence of targets
When fetching an object that is known as a promisor object to the local
repository, the connectivity check in quickfetch()
in builtin/fetch.c
succeeds, causing object transfer to be bypassed.
However, this should not happen if that object is merely promised and not actually present.
Because this happens, when a user invokes "git fetch origin <sha-1>
" on
the command-line, the <sha-1>
object may not actually be fetched even
though the command returns an exit code of 0. This is a similar issue
(but with a different cause) to the one fixed by a0c9016
("upload-pack: send refs' objects despite "filter"", 2018-07-09, Git v2.19.0-rc0).
Therefore, update quickfetch()
to also directly check for the presence
of all objects to be fetched.
From git help clone
:
--depth <depth>
Create a shallow clone with a history truncated to the specified number of revisions. A shallow repository has a number of limitations (you
cannot clone or fetch from it, nor push from nor into it), but is adequate if you are only interested in the recent history of a large project
with a long history, and would want to send in fixes as patches.
Does that provide something like what you're looking for?