Q:1
We are thinking of parallelizing read/write to ADLA tables and was wondering what are implications of such design.
I think reads are fine but what should be the best practice to have concurrent writes to same ADLA table.
Q:2
Suppose we have USQL scripts which has multiple rowsets and multiple output/insert in same/different ADLA tables. What is transaction scope story in USQL. If any of output/insert statement fails then will it cause all previous inserts to rollback or not. How to handle transaction scope
Thanks
Amit
Before I answer, let me describe what happens when you insert into a table (I assume that's what you mean with writes to a table and not truncate/insert).
Each INSERT
statement will create a new extent file for the table. Thus if you insert new rows (recommendation is to insert many rows at a time and not just one row), a new file will gets created and the meta data will get updated during the finalization phase so the meta data service knows that the file belongs to the table.
So you should be able to run several inserts in parallel.
The transactional scope is currently as follows (note that Azure Data Lake Analytics' platform is a big data processing and not an OLTP platform and thus does not provide different transactional guarantees to choose from):
The batch processing of U-SQL in ADLA is done in 4 phases:
- Preparation contains the compilation, optimization and code generation
- Queuing where a job waits for all the needed resources
- Actual runtime execution phase
- Finalization phase where files and metadata gets persisted.
During the runtime phase, either all vertices succeed or fail if a runtime error occurs. So it is all or nothing.
Once the processing enters the finalization phase, the atomicity is reduced to the file or table level. You may generate 3 files but finalizing one file may fail for some reason. then the job fails but the 2 files that succeeded will be created.