Fastest way to bulk upload files into Azure (C#)

2019-04-28 10:48发布

问题:

What is the fastest way to bulk upload files Azure Blob Storage? I've tried two methods, sync and async uploads, async is obviously the fastest but I'm wondering if there is a better method? Is there built in support for batch uploads? I can't find anything in the documentation but might of missed it.

This is the test I ran:

static void Main(string[] args)
{
    int totalFiles = 10; //10, 50, 100
    byte[] randomData = new byte[2097152]; //2mb
    for (int i = 0; i < randomData.Length; i++)
    {
        randomData[i] = 255;
    }

    CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(ConfigurationManager.AppSettings["StorageConnectionString"]);
    var blobClient = cloudStorageAccount.CreateCloudBlobClient();

    var container = blobClient.GetContainerReference("something");
    container.CreateIfNotExists();


    TimeSpan tsSync = Test1(totalFiles, randomData, container);
    TimeSpan tsAsync = Test2(totalFiles, randomData, container);

    Console.WriteLine($"Sync: {tsSync}");
    Console.WriteLine($"Async: {tsAsync}");

    Console.ReadLine();

}

public static TimeSpan Test2(int total, byte[] data, CloudBlobContainer container)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    Task[] tasks = new Task[total];
    for (int i = 0; i < total; i++)
    {
        CloudBlockBlob blob = container.GetBlockBlobReference(Guid.NewGuid().ToString());
        tasks[i] = blob.UploadFromByteArrayAsync(data, 0, data.Length);

    }
    Task.WaitAll(tasks);


    sw.Stop();
    return sw.Elapsed;
}

public static TimeSpan Test1(int total, byte[] data, CloudBlobContainer container)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < total; i++)
    {
        CloudBlockBlob blob = container.GetBlockBlobReference(Guid.NewGuid().ToString());
        blob.UploadFromByteArray(data, 0, data.Length);

    }
    sw.Stop();
    return sw.Elapsed;
}

The output from this is:

10 Files

Sync: 00:00:08.7251781
Async: 00:00:04.7553491
DMLib: 00:00:05.1961654

Sync: 00:00:08.1169861
Async: 00:00:05.2384105
DMLib: 00:00:05.4955403

Sync: 00:00:07.6122464
Async: 00:00:05.0495365
DMLib: 00:00:06.4714047

50 Files

Sync: 00:00:39.1595797
Async: 00:00:22.5757347
DMLib: 00:00:25.2897623

Sync: 00:00:40.4932800
Async: 00:00:22.3296490
DMLib: 00:00:26.0631829

Sync: 00:00:39.2879245
Async: 00:00:24.0746697
DMLib: 00:00:26.9243116

I hope this is a valid question for SO.

Thanks

EDIT:

I have updated the results with "DMLib" tests in response to the answers given so far. DMLib is a test with no config changes (see above) no performance gains

I ran some more tests with ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8; as recommend by the documention, this increased the upload speed by quite a bit, but it also increased the upload speed of my async method. So far the DMlib has not given me any performance increases that are worthy. I've added the second set of test results with this config change below.

I also set ServicePointManager.Expect100Continue = false; however the this made no difference to speed.

Test results with ServicePointManager.DefaultConnectionLimit = Environment.ProcessorCount * 8;

10 Files

Sync: 00:00:07.6199307
Async: 00:00:02.9615565
DMLib: 00:00:02.6629716

Sync: 00:00:08.7721797
Async: 00:00:02.8246599
DMLib: 00:00:02.7281091

Sync: 00:00:07.8437682
Async: 00:00:03.0171246
DMLib: 00:00:03.0190045

50 Files

Sync: 00:00:40.2395863
Async: 00:00:10.3157544
DMLib: 00:00:10.5107740

Sync: 00:00:40.2473358
Async: 00:00:10.8190161
DMLib: 00:00:10.2585441

Sync: 00:00:41.2646137
Async: 00:00:13.7188085
DMLib: 00:00:10.8686173

Am I using the library incorrectly as it does not seem to provide any better performance than my own method.

回答1:

Please use Azure Storage Data Movement Library, which is the core of AzCopy. This library is exactly the tool to resolve your problem. :)



回答2:

Use Azcopy to accomplish your job. Unfortunately, it's a standalone exe.

You can also split files into blocks(read by start and offset) and upload them in parallel. It's a bit complex, you have to tune upload threads based on machine.