-->

Large Files in Source Control (TFS)

2019-02-16 10:04发布

问题:

Recently at the office we have been talking about placing large files into our TFS repository. The files themselves are XML, usually 100-200MB in size, and sometimes as large as 1GB. We use them as data for automated testing and they are mostly static (one gets a minor tweak every year or so). Anyway, there is a notion that putting files like this into the repository is a no-no because they are "big" and that will make things "slow" (outside of the original check-in/out) but we don't really have any evidence to back this up.

So my question is, what are the pros / cons / implications of putting large static files into a source code repository like TFS (or SVN, Git, etc. for that matter) Is it OK? Will it "fill up the server" or have some other dire consequence?

回答1:

tl;dr: TFS is designed to handle large files gracefully. The largest hurdle you'll have to face is network bandwidth to upload/download the files. The second issue is that of storage space on the server. Assuming you've considered these two issues, you shouldn't have any other problems.

Network bandwidth: There is very little overhead in checking in or getting files, it should be as fast as a typical HTTP upload or download. If your clients are remote from the server, network-wise, they may benefit by having a TFS source control proxy on their local network to speed up downloads.

Note that unlike some version control systems, TFS does not compute and transmit deltas when uploading or downloading new content. That is to say, if a client had revision 4 of a large text file, and revision 5 had added a few lines at the end, some version control tools optimize this experience to only send the changed lines. TFS does not do this optimization, so if your files change frequently, clients will need to download the entirety of the file each time.

Server storage: Disk space on the server is fairly straightforward - you'll need enough space to hold the files, there's little overhead beyond that. TFS will not slow down just because your repository contains large files.

If these files get modified frequently, you will need to account for the disk space used by the revisions, also. TFS stores "deltas" between file revisions - that is, a binary difference between two versions. So if the file's contents change minimally between revisions as in the typical use case with text files, the storage cost should be inexpensive. However, if the entirety of the contents change as would be typical with binary files like images or DLLs, then you'll need enough disk space to store each revision. (Of course, you can destroy previous revisions in order to regain that space.)

One note on deltas in TFS: to reduce overhead at check-in time, the deltas between revisions are not computed immediately, there's a background "deltafication" job that runs nightly to compute the deltas to trim space. Until that point, each revision is stored in its entirety in the database. So if you have a very large text file with a lot of revisions happening daily, your disk space requirements will need to take this into account.

Client storage: Clients will need to have enough disk space to contain these files also (although only at the revision that they've downloaded.) This can be mitigated in your workspace mappings such that the large files are cloaked (or otherwise not included in your workspace) if they're not needed.

Caveat: Getting Historic Versions: If you find yourself requesting historical versions of large files frequently (for example: I want an ISO image seven changesets ago), then you're going to make the server apply the delta chain to get back to that revision. If you have multiple clients doing this concurrently, this could tax your memory.



回答2:

If those files were constantly changing & their deltas were big, I would eventually expect a penalty in the overall TFS performance.

You clearly state that this is not the case, so, provided that your SQL server has the capacity to house the storage, I believe you should be able to proceed without any implications.

A minor downside you may experience, is when you 're constructing new workspaces, where you would have to pull those files from their repository. Unfortunately this does also happen during TFS Build, so it's possible that your builds will now take that much longer. The severity of this angle greatly depends on your network constellation/stability.



回答3:

The biggest problem (inconvenience) you'll have is having to download these massive files to all your workspaces, or map them out. Consider putting them into a separate team project to make this easier (unless you want to include them in branches, in which case I'd abuse keeping everything in one team project)

If you have control of the xml format then also consider a few tweaks to make them smaller. This will improve performance of store/get operations and also loading speed... Shorten element and attribute names, reduce the number of decimal places you are outputting for floating point numbers, etc. You will find threat simple schemes like this will knock many megabytes off the size of Gb-sized files, and it's easy to knock up a quick xslt transform or code to convert the files quickly over to the new format.