Uploading 10,000,000 files to Azure blob storage f

2020-04-04 07:21发布

问题:

I have some experience with S3, and in the past have used s3-parallel-put to put many (millions) small files there. Compared to Azure, S3 has an expensive PUT price so I'm thinking to switch to Azure.

I don't however seem to be able to figure out how to sync a local directory to a remote container using azure cli. In particular, I have the following questions:

1- aws client provides a sync option. Is there such an option for azure?

2- Can I concurrently upload multiple files to Azure storage using cli? I noticed that there is a -concurrenttaskcount flag for azure storage blob upload, so I assume it must be possible in principle.

回答1:

If you prefer the commandline and have a recent Python interpreter, the Azure Batch and HPC team has released a code sample with some AzCopy-like functionality on Python called blobxfer. This allows full recursive directory ingress into Azure Storage as well as full container copy back out to local storage. [full disclosure: I'm a contributor for this code]

To answer your questions:

  1. blobxfer supports rsync-like operations using MD5 checksum comparisons for both ingress and egress
  2. blobxfer performs concurrent operations, both within a single file and across multiple files. However, you may want to split up your input across multiple directories and containers which will not only help reduce memory usage in the script but also will partition your load better


回答2:

https://github.com/Azure/azure-sdk-tools-xplat is the source of azure-cli, so more details can be found there. You may want to open issues to the repo :)

  1. azure-cli doesn't support "sync" so far.
  2. -concurrenttaskcount is to support parallel upload within a single file, which will increase the upload speed a lot, but it doesn't support multiple files yet.


回答3:

In order to upload bulk files into the blob storage there is a tool provided by Microsoft Checkout the storage explorer allows you to do the required task..