Situation
I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLab on Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.
Possible solutions
After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.This one was found non working.
Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.
Questions
- Is the solution possible?
- If git is pushing/pulling to/from repository, does it upload/download whole files, or just changes in them (i.e. adds new lines or edits the current ones)?
Can Git manage so large files?No.How to set how many revisions are preserved in a repository?Doesn't matter with the new solution.- Is there any better solution? I don't want to force the developers to download such large files over FTP or anything similar.
You really, really, really do not want large binary files checked into your Git repository.
Each update you add will cumulatively add to the overall size of your repository, meaning that down the road your Git repo will take longer and longer to clone and use up more and more disk space, because Git stores the entire history of the branch locally, which means when someone checks out the branch, they don't just have to download the latest version of the database; they'll also have to download every previous version.
If you need to provide large binary files, upload them to some server separately, and then check in a text file with a URL where the developer can download the large binary file. FTP is actually one of the better options, since it's specifically designed for transferring binary files, though HTTP is probably even more straightforward.
rsync could be a good option for efficiently updating the developers copies of the databases.
It uses a delta algorithm to incrementally update the files. That way it only transfers the blocks of the file that have changed or that are new. They will of course still need to download the full file first but later updates would be quicker.
Essentially you get a similar incremental update as a git fetch without the ever expanding initial copy that the git clone would give. The loss is not having the history but is sounds like you don't need that.
rsync is a standard part of most linux distributions if you need it on windows there is a packaged port available: http://itefix.no/cwrsync/
To push the databases to a developer you could use a command similar to:
Or the developers could pull the database(s) they need with:
Update 2017:
Microsoft is contributing to Microsoft/GVFS: a Git Virtual File System which allows Git to handle "the largest repo on the planet"
(ie: the Windows code base, which is approximately 3.5M files and, when checked in to a Git repo, results in a repo of about 300GB, and produces 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds)
Some parts of GVFS might be contributed upstream (to Git itself).
But in the meantime, all new Windows development is now (August 2017) on Git.
Update April 2015: GitHub proposes: Announcing Git Large File Storage (LFS)
Using git-lfs (see git-lfs.github.com) and a server supporting it: lfs-test-server, you can store metadata only in the git repo, and the large file elsewhere.
See git-lfs/wiki/Tutorial:
Original answer:
Regarding what the git limitations with large files are, you can consider bup (presented in details in GitMinutes #24)
The design of bup highlights the three issues that limits a git repo:
git gc
to generate one packfile at a time.Handling huge files and
xdelta
Handling huge numbers of files and
git gc
Handling huge repository (meaning huge numbers of huge packfiles)
Having a auxiliary storage of files referenced from your git-stashed code is where most people go.
git-annex
does look pretty comprehensive, but many shops just use an FTP or HTTP (or S3) repository for the large files, like SQL dumps. My suggestion would be to tie the code in the git repo to the names of the files in the auxiliary storage by stuffing some of the metadata - specifically a checksum (probably SHA) - in to the hash, as well as a date.Cramming huge files into git (or most repos) has a nasty impact on git's performance after a while - a
git clone
really shouldn't take twenty minutes, for example. Whereas using the files by reference means that some developers will never need to download the large chunks at all (a sharp contrast to thegit clone
), since the odds are that most are only relevant to the deployed code in production. Your mileage may vary, of course.You can look at solution like git-annex, which is about managing (big) files with git, without checking the file contents into git(!)
(Feb 2015: a service hosting like GitLab integrates it natively:
See "Does GitLab support large files via
git-annex
or otherwise?")git doesn't manage big files, as explained by Amber in her answer.
That doesn't mean git won't be able to do better one day though.
From GitMinutes episode 9 (May 2013, see also below), From Peff (Jeff King), at 36'10'':
(transcript)
Not an high-priority project for now...
3 years later, in April 2016, Git Minutes 40 includes an interview of Michael Haggerty from GitHub around 31' (Thank you Christian Couder for the interview).
He is specialized in reference back-end for quite a while.
He is citing David Turner's work on back-end as the most interesting at the moment. (See David's current "
pluggable-backends
" branch of his git/git fork)(transcript)
[follows other considerations around having faster packing, and reference patch advertisement]
Large files uploading sometime create issues and errors. This happens usually. Mainly git supports less than 50MB file to upload. For uploading more than 50MB files in git repository user should need to install another assistant that cooperates to upload big file(.mp4,.mp3,.psd) etc.
there are some basic git commands you know before uploading big file in git. this is the configuration for uploading at github. it needs to install gitlfs.exe
intall it from lfsinstall.exe
then you should use basic commands of git along with some different
you may find you find it
lfs.https://github.com/something/repo.git/info/lfs.locksverify false
like instructions during push command if push without using it