I am now trying to copy data from cosmosdb to data lake store by data factory.
However, the performance is poor, about 100KB/s, and the data volume is 100+ GB, and keeps increasing. It will take 10+ days to finish, which is not acceptable.
Microsoft document https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance mentioned that the max speed from cosmos to data lake store is 1MB/s. Even this, the performance is still bad for us.
The cosmos migration tool doesn't work, no data exported, and no issue log.
Data lake analytics usql can extract external sources, but currently only Azure DB/DW and SQL Server are supported, no cosmosdb.
How/what tools can improve the copy performance?
According to your description, I suggest you could try to set high cloudDataMovementUnits to improve the performance.
A cloud data movement unit (DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. A DMU might be used in a cloud-to-cloud copy operation, but not in a hybrid copy.
By default, Data Factory uses a single cloud DMU to perform a single Copy Activity run. To override this default, specify a value for the cloudDataMovementUnits property as follows. For information about the level of performance gain you might get when you configure more units for a specific copy source and sink, see the performance reference.
Notice: Setting of 8 and above currently works only when you copy multiple files from Blob storage/Data Lake Store/Amazon S3/cloud FTP/cloud SFTP to Blob storage/Data Lake Store/Azure SQL Database.
So the max DMU you could set is 4.
Besides, if this speed doesn't match your current requirement.
I suggest you could write your own logic to copy the documentdb to data lake.
You could create multiple webjobs which could use parallel copy from the documentdb to data lake.
You could convert the document according to index range or partition, then you could make each webjob copy different part. In my opinion, this will be faster.
About the dmu, can I use it directly or should I apply for it first? The web jobs you mean is dotnet activity? Can you give some more details?
As far as I know, you could directly use the dmu, you could directly add the dmu value in the json file as below:
"activities":[
{
"name": "Sample copy activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "InputDataset" }],
"outputs": [{ "name": "OutputDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"cloudDataMovementUnits": 32
}
}
]
The webjob which could run programs or scripts in WebJobs in your Azure App Service web app in three ways: on demand, continuously, or on a schedule.
That means you could write a C# program(or using other code language) to run the programs or scripts to copy the data from documentdb to data lake(all of the logic should be written by yourself).