Querying DynamoDB by date

2019-01-08 06:24发布

站内文章 / NoSql

44 0

爷、活的狠高调

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm coming from a relational database background and trying to work with amazon's DynamoDB

I have a table with a hash key "DataID" and a range "CreatedAt" and a bunch of items in it.

I'm trying to get all the items that were created after a specific date and sorted by date. Which is pretty straightforward in a relational database.

In DynamoDB the closest thing i could find is a query and using the range key greater than filter. The only issue is that to perform a query i need a hash key which defeats the purpose.

So what am I doing wrong? Is my table schema wrong, shouldn't the hash key be unique? or is there another way to query?

回答1:

Updated Answer:

DynamoDB allows for specification of secondary indexes to aid in this sort of query. Secondary indexes can either be global, meaning that the index spans the whole table across hash keys, or local meaning that the index would exist within each hash key partition, thus requiring the hash key to also be specified when making the query.

For the use case in this question, you would want to use a global secondary index on the "CreatedAt" field.

For more on DynamoDB secondary indexes see the secondary index documentation

Original Answer:

DynamoDB does not allow indexed lookups on the range key only. The hash key is required such that the service knows which partition to look in to find the data.

You can of course perform a scan operation to filter by the date value, however this would require a full table scan, so it is not ideal.

If you need to perform an indexed lookup of records by time across multiple primary keys, DynamoDB might not be the ideal service for you to use, or you might need to utilize a separate table (either in DynamoDB or a relational store) to store item metadata that you can perform an indexed lookup against.

回答2:

Given your current table structure this is not currently possible in DynamoDB. The huge challenge is to understand that the Hash key of the table (partition) should be treated as creating separate tables. In some ways this is really powerful (think of partition keys as creating a new table for each user or customer, etc...).

Queries can only be done in a single partition. That's really the end of the story. This means if you want to query by date (you'll want to use msec since epoch), then all the items you want to retrieve in a single query must have the same Hash (partition key).

I should qualify this. You absolutely can scan by the criterion you are looking for, that's no problem, but that means you will be looking at every single row in your table, and then checking if that row has a date that matches your parameters. This is really expensive, especially if you are in the business of storing events by date in the first place (i.e. you have a lot of rows.)

You may be tempted to put all the data in a single partition to solve the problem, and you absolutely can, however your throughput will be painfully low, given that each partition only receives a fraction of the total set amount.

The best thing to do is determine more useful partitions to create to save the data:

Do you really need to look at all the rows, or is it only the rows by a specific user?
Would it be okay to first narrow down the list by Month, and do multiple queries (one for each month)? Or by Year?
If you are doing time series analysis there are a couple of options, change the partition key to something computated on PUT to make the query easier, or use another aws product like kinesis which lends itself to append-only logging.

回答3:

Approach I followed to solve this problem is by created a Global Secondary Index as below. Not sure if this is the best approach but hopefully if it is useful to someone.

Hash Key                 | Range Key
------------------------------------
Date value of CreatedAt  | CreatedAt

Limitation imposed on the HTTP API user to specify the number of days to retrieve data, defaulted to 24 hr.

This way, I can always specify the HashKey as Current date's day and RangeKey can use > and < operators while retrieving. This way the data is also spread across multiple shards.

回答4:

Your Hash key (primary of sort) has to be unique (unless you have a range like stated by others).

In your case, to query your table you should have a secondary index.

|  ID  | DataID | Created | Data |
|------+--------+---------+------|
| hash | xxxxx  | 1234567 | blah |

Your Hash Key is ID Your secondary index is defined as: DataID-Created-index (that's the name that DynamoDB will use)

Then, you can make a query like this:

var params = {
    TableName: "Table",
    IndexName: "DataID-Created-index",
    KeyConditionExpression: "DataID = :v_ID AND Created > :v_created",
    ExpressionAttributeValues: {":v_ID": {S: "some_id"},
                                ":v_created": {N: "timestamp"}
    },
    ProjectionExpression: "ID, DataID, Created, Data"
};

ddb.query(params, function(err, data) {
    if (err) 
        console.log(err);
    else {
        data.Items.sort(function(a, b) {
            return parseFloat(a.Created.N) - parseFloat(b.Created.N);
        });
        // More code here
    }
});

Essentially your query looks like:

SELECT * FROM TABLE WHERE DataID = "some_id" AND Created > timestamp;

The secondary Index will increase the read/write capacity units required so you need to consider that. It still is a lot better than doing a scan, which will be costly in reads and in time (and is limited to 100 items I believe).

This may not be the best way of doing it but for someone used to RD (I'm also used to SQL) it's the fastest way to get productive. Since there is no constraints in regards to schema, you can whip up something that works and once you have the bandwidth to work on the most efficient way, you can change things around.

回答5:

You could make the Hash key something along the lines of a 'product category' id, then the range key as a combination of a timestamp with a unique id appended on the end. That way you know the hash key and can still query the date with greater than.

回答6:

You can have multiple identical hash keys; but only if you have a range key that varies. Think of it like file formats; you can have 2 files with the same name in the same folder as long as their format is different. If their format is the same, their name must be different. The same concept applies to DynamoDB's hash/range keys; just think of the hash as the name and the range as the format.

Also, I don't recall if they had these at the time of the OP (I don't believe they did), but they now offer Local Secondary Indexes.

My understanding of these is that it should now allow you to perform the desired queries without having to do a full scan. The downside is that these indexes have to be specified at table creation, and also (I believe) cannot be blank when creating an item. In addition, they require additional throughput (though typically not as much as a scan) and storage, so it's not a perfect solution, but a viable alternative, for some.

I do still recommend Mike Brant's answer as the preferred method of using DynamoDB, though; and use that method myself. In my case, I just have a central table with only a hash key as my ID, then secondary tables that have a hash and range that can be queried, then the item points the code to the central table's "item of interest", directly.

Additional data regarding the secondary indexes can be found in Amazon's DynamoDB documentation here for those interested.

Anyway, hopefully this will help anyone else that happens upon this thread.

回答7:

Updated Answer There is no convenient way to do this using Dynamo DB Queries with predictable throughput. One (sub optimal) option is to use a GSI with an artificial HashKey & CreatedAt. Then query by HashKey alone and mention ScanIndexForward to order the results. If you can come up with a natural HashKey (say the category of the item etc) then this method is a winner. On the other hand, if you keep the same HashKey for all items, then it will affect the throughput mostly when when your data set grows beyond 10GB (one partition)

Original Answer: You can do this now in DynamoDB by using GSI. Make the "CreatedAt" field as a GSI and issue queries like (GT some_date). Store the date as a number (msecs since epoch) for this kind of queries.

Details are available here: Global Secondary Indexes - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html#GSI.Using

This is a very powerful feature. Be aware that the query is limited to (EQ | LE | LT | GE | GT | BEGINS_WITH | BETWEEN) Condition - Amazon DynamoDB : http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Condition.html