Retrieving many rows using a TableBatchOperation i

2019-02-16 12:12发布

站内文章 / NoSql

70 0

冷血范

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Here is a piece of code that initialize a TableBatchOperation designed to retrieve two rows in a single batch:

 TableBatchOperation batch = new TableBatchOperation();
 batch.Add(TableOperation.Retrieve("somePartition", "rowKey1"));
 batch.Add(TableOperation.Retrieve("somePartition", "rowKey2")); 
 //second call throws an ArgumentException:
 //"A batch transaction with a retrieve operation cannot contain 
 //any other operation"

As mentionned, an exception is thrown, and it seems not supported to retrieve N rows in a single batch. This is a big deal to me, as I need to retrieve about 50 rows per request. This issue is as much performance wise as cost wise. As you may know, Azure Table Storage pricing is based on the amount of transactions, which means that 50 retrieve operations is 50 times more expensive than a single batch operation.

Have I missed something?

Side note I'm using the new Azure Storage api 2.0. I've noticed this question has never been raised on the web. This constraint might have been added recently?

edit

I found a related question here: Very Slow on Azure Table Storage Query on PartitionKey/RowKey List. It seems using TableQuery with "or" on rowkeys will results with a full table scan. There's really a serious issue here...

回答1:

When designing your Partition Key (PK) and Row Key (RK) scheme in Azure Table Storage (ATS) your primary consideration should be how you're going to retrieve the data. As you've said each query you run costs both money, but more importantly time so you need to get all of the data back in one efficient query. The efficient queries that you can run on ATS are of these types:

Exact PK and RK
Exact PK, RK range
PK Range
PK Range, RK range

Based on your comments I'm guessing you've got some data that is similar to this:

PK    RK     Data
Guid1 A      {Data:{...}, RelatedRows: [{PK:"Guid2", RK:"B"}, {PK:"Guid3", RK:"C"}]}
Guid2 B      {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}]
Guid3 C      {Data:{...}, RelatedRows: [{PK:"Guid1", RK:"A"}];}

and you've retrieved the data at Guid1, and now you need to load Guid2 and Guid3. I'm also presuming that these rows have no common denominator like they're all for the same user. With this in mind I'd create an extra "index table" which could look like this:

PK      RK      Data
Guid1-A Guid2-B {Data:{....}}
Guid1-A Guid3-C {Data:{....}}
Guid2-B Guid1-A {Data:{....}}
Guid2-B Guid1-A {Data:{....}}

Where the PK is the combined PK and RK of the parent and the RK is the combined PK and RK of the child row. You can then run a query which says return all rows with PK="Guid1-A" and you will get all related data with just one call (or two calls overall). The biggest overhead this creates is in your writes, so now when you right a row you also have to write rows for each of the related rows as well and also make sure that the data is kept up to date (this may not be an issue for you if this is a write once kind of scenario).

If any of my assumptions are wrong or if you have some example data I can update this answer with more relevant examples.

回答2:

Try something like this:

TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>()
                                                .Where(TableQuery.CombineFilters(
                                                    TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, "partition1"),
                                                    TableOperators.And,
                                                    TableQuery.CombineFilters(
                                                        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row1"),
                                                        TableOperators.Or,
                                                        TableQuery.GenerateFilterCondition("RowKey", QueryComparisons.Equal, "row2"))));

回答3:

I know that this is an old question, but as Azure STILL does not support secondary indexes, it seems it will be relevant for some time.

I was hitting the same type of problem. In my scenario, I needed to lookup hundreds of items within the same partition, where there are millions of rows (imagine GUID as row-key). I tested a couple of options to lookup 10,000 rows

(PK && RK)
(PK && RK1) || (PK & RK2) || ...
PK && (RK1 || RK2 || ... )

I was using the Async API, with a maximum 10 degrees of parallelism (max 10 outstanding requests). I also tested a couple of different batch sizes (10 rows, 50, 100).

Test                        Batch Size  API calls   Elapsed (sec)
(PK && RK)                  1           10000       95.76
(PK && RK1) || (PK && RK2)  10          1000        25.94
(PK && RK1) || (PK && RK2)  50          200         18.35
(PK && RK1) || (PK && RK2)  100         100         17.38
PK && (RK1 || RK2 || … )    10          1000        24.55
PK && (RK1 || RK2 || … )    50          200         14.90
PK && (RK1 || RK2 || … )    100         100         13.43

NB: These are all within the same partition - just multiple rowkeys.

I would have been happy to just reduce the number of API calls. But as an added benefit, the elapsed time is also significantly less, saving on compute costs (at least on my end!).

Not too surprising, the batches of 100 rows delivered the best elapsed performance. There are obviously other performance considerations, especially network usage (#1 hardly uses the network at all for example, whereas the others push it much harder)

EDIT Be careful when querying for many rowkeys. There is (or course) a URL length limitation to the query. If you exceed the length, the query will still succeed because the service can not tell that the URL was truncated. In our case, we limited the combined query length to about 2500 characters (URL encoded!)

回答4:

Batch "Get" operations are not supported by Azure Table Storage. Supported operations are: Add, Delete, Update, and Merge. You would need to execute queries as separate requests. For faster processing, you may want to execute these queries in parallel.

回答5:

Your best bet is to create a Linq/OData select query... that will fetch what you're looking for.

For better performance you should make one query per partition and run those queries simultaneously.

I haven't tested this personally, but think it would work.

回答6:

How many entities do you have per partition? With one retrieve operation you can pull back up to 1000 records per query. Then you could do your Row Key filtering on the in memory set and only pay for 1 operation.

Another option is to do a Row Key range query to retrieve part of a partition in one operation. Essentially you specify an upper and lower bound for the row keys to return, rather than an entire partition.

回答7:

Okay, so a batch retrieve operation, best case scenario is a table query. Less optimal situation would require parallel retrieve operations.

Depending on your PK, RK design you can based on a list of (PK, RK) figure out what is the smallest/most efficient set of retrieve/query operations that you need to perform. You then fetch all these things in parallel and sort out the exact answer client side.

IMAO, it was a design miss by Microsoft to add the Retrieve method to the TableBatchOperation class because it conveys semantics not supported by the table storage API.

Right now, I'm not in the mood to write something super efficient, so I'm just gonna leave this super simple solution here.

var retrieveTasks = new List<Task<TableResult>>();

foreach (var item in list)
{
    retrieveTasks.Add(table.ExecuteAsync(TableOperation.Retrieve(item.pk, item.rk)));
}

var retrieveResults = new List<TableResult>();

foreach (var retrieveTask in retrieveTasks)
{
    retrieveResults.Add(await retrieveTask);
}

This asynchronous block of code will fetch the entities in list in parallel and store the result in the retrieveResults preserving the order. If you have continuous ranges of entities that you need to fetch you can improve this by using a rang query.

There a sweet spot (that you'll have to find by testing this) is where it's probably faster/cheaper to query more entities than you might need for a specific batch retrieve then discard the results you retrieve that you don't need.

If you have a small partition you might benefit from a query like so:

where pk=partition1 and (rk=rk1 or rk=rk2 or rk=rk3)

If the lexicographic (i.e. sort order) distance is great between your keys you might want to fetch them in parallel. For example, if you store the alphabet in table storage, fetching a and z which are far apart is best to do with parallel retrieve operations while fetching a, b and c which are close together is best to do with a query. Fetching a, b c, and z would benefit from a hybrid approach.

If you know all this up front you can compute what is the best thing to do given a set of PKs and RKs. The more you know about how the underlying data is sorted the better your results will be. I'd advice a general approach to this one and instead, try to apply what you learn from these different query patterns to solve your problem.