Order Results by Priority & Group Values & then fi

2019-07-10 05:52发布

问题:

Here is my DynamoDB current data:

My goal is to create a query which filters the result in the Group Set (to like "Default") , and then sorts by priority and then filters the results to those where loggedIn == true and status == idle.

In SQL it would be something like

SELECT * 
FROM userstatustable 
WHERE group == "default" 
  AND loggedIn == true AND status == "idle" 
ORDER BY priority DESC 
LIMIT 1

How would I create a query to do this?

Below is my serverless.yml file description of the DynamoDB table.

userStatusTable: #This table is used to track a users current status.
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: ${self:custom.userStatusTable}
        AttributeDefinitions: #UserID in this case will be created once and constantly updated as it changes with status regarding the user.
          - AttributeName: userId
            AttributeType: S
        KeySchema:
          - AttributeName: userId
            KeyType: HASH
        ProvisionedThroughput:
            ReadCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}
            WriteCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}

Things I have tried:

Below is my current code so far:

 const userStatusParams = {
        TableName: process.env.USERSTATUS_TABLE,
        FilterExpression: "loggedIn = :loggedIn and #s = :status and contains(#g,:group) ",
        //Limit: 1,
        ExpressionAttributeValues: {
          ":loggedIn": true,
          ":status" : "idle",
          ":group" : "DEFAULT"
        },
        ExpressionAttributeNames: {"#s": "status","#g" : "group"}
      };
      var usersResult;
      try {
        usersResult = await dynamoDbLib.call("scan", userStatusParams);
        console.log(usersResult);
      }catch (e) {
        console.log("Error occurred querying for users belong to group.");
        console.log(e);
      }

This uses scan and is able to return all the results that meet the criteria ... however it does not sort the results by priority in descending order.

Note: status and group are reserved keywords apparently so I had to use ExpressionAttributeNames to account for that. Also note that this table will have thousands of users in it eventually.

回答1:

Indexes are not about sorting. Sorting is but one method used to efficiently retrieve rows, because search in sorted array can be done in logarithmic time O(log n), instead of linear time O(n). It's only a consequence that rows are returned in sorted order. But let's focus on the capability of finding the exact rows to be returned with less effort (I/O e.g. disk reads).

The filtering needs for this type of query (by group, status and a couple more columns) are really hard for DynamoDB fo process efficiently. By efficiency I mean how many rows DynamoDB needs to retrieve from disk to determine which rows to return to the client. If it returns 10% of total rows read from the table, that's not efficient. This is why a regular Scan together with filters isn't as good as an indexed query. Filters are a lie, as they still read from items from database and count towards provisioned capacity. An indexed query will retrieve from storage rows close to the number it actually returns. This is achieved with DynamoDB but limiting to a single partition (items with the same partition/hash key), and a range (starting with, >=, <=) for a sort key.

Why rows are not returned sorted by sort key when doing a scan? Because DynamoDB uses the sort key inside Item Collections, each collection being determined by the hash key. When the result set contains, for instance, 2 unique hash keys, the result set will contain 2 separate sections sorted by the sort key, in other words, rows will not be sorted in a single direction, they'll restart in the middle of the result set. Sorting in memory will be required to have a single sorted collection.

Why not create then an index on a column that can have a single value for all rows? If we run a scan, then rows will be returned sorted by priority (sort key). But having all items containing the same value for a field, this is a data anomaly.

So when should I create an index?

  • I want to query on a field that could be empty for some/many rows, so the index would contain fewer items than the whole table;
  • My query would apply a range operator that will be able to select a smaller portion of data;

Given the group attribute should likely be the most selective, it would be faster to hash on this attribute on a global index, but this would change the model, requiring to store each group in a separate item, instead of using a string set. That's not very convenient in the NoSQL world, requiring more of a relational model.


So, one way to do it, considering it's okay to use a scan, but without a separate index, is running a scan and sorting in memory. Use the Array#sort() method to do it in node.js. The performance characteristics are closer to the approach with secondary index, only a index would be just a waste of resources in this case. Because if a query/scan on an index returns the same amout of information a scan on a table would, go with the table approach. Remember: indexes are about being selective when retrieving rows.

How would I know if this is a good approach for my use case? Well, this isn't a definitive rule, but I'd say if you want to retrieve more than 50% of the table rows, this would be okay. In matter of costs it wouldn't pay off keeping a separate index. Even looking into another design perhaps, because this isn't very selective. Now if you want like 20% or less data, then a different approach would be nice to look into.



回答2:

Link to my other answer that uses the main table as designed.

This approach requires modifying group modeling in UserStatus from a single record with a string set to multiple records with a string. This is because DynamoDB doesn't support (yet, that makes for a good feature request though) keying on sets.

The main table is used for updates/inserts/deletes and looks like this:

+--------+---------+-------+----------+----------+--------+
| userId | group   | type  | priority | loggedIn | status |
+--------+---------+-------+----------+----------+--------+
| 123    | default | admin | 1        | true     | idle   |
+--------+---------+-------+----------+----------+--------+
| 123    | orange  | admin | 1        | true     | idle   |
+--------+---------+-------+----------+----------+--------+
| 124    | default | admin | 3        | false    | idle   |
+--------+---------+-------+----------+----------+--------+
| 125    | orange  | admin | 2        | false    | idle   |
+--------+---------+-------+----------+----------+--------+
  • Partition/hash key: userId
  • Sort key: group

Setup a GSI on (group, priority). This will be used for queries. Yes, the combination chosed for this index will have duplicates: DynamoDB doesn't bother with this and works nicely.

+---------+----------+--------+-------+----------+--------+
| group   | priority | userId | type  | loggedIn | status |
+---------+----------+--------+-------+----------+--------+
| default | 1        | 123    | admin | true     | idle   |
+---------+----------+--------+-------+----------+--------+
| default | 3        | 124    | admin | false    | idle   |
+---------+----------+--------+-------+----------+--------+
| orange  | 1        | 123    | admin | true     | idle   |
+---------+----------+--------+-------+----------+--------+
| orange  | 2        | 125    | admin | false    | idle   |
+---------+----------+--------+-------+----------+--------+

Tasks:

  • Updating an user on this tables requires update/insert as many rows as there are groups the user belongs too;
  • Removing an user means removing all items for the user.
  • Querying is done by group = :group and priority >= :priority, filtering on status = 'idle' and loggedIn = true
    • a variant is to sort on status or loggedIn, because you filter with them, this help making the query a bit more selective, then sort on priority on the client

Should I follow this approach? I think it is a good design when there are many groups and a single group contains up to 20% of total users, and users belong to 2 or 2 groups.



回答3:

So I found an interesting solution to this problem.

Here is my new code.

const userStatusParams = {
        TableName: process.env.USERSTATUS_TABLE,
        IndexName:"typePriorityIndex",
        FilterExpression: "loggedIn = :loggedIn and #s = :status and contains(#g,:group) ",
        KeyConditionExpression: "#t = :type and priority >= :priority",
        Limit: 1,
        ExpressionAttributeValues: {
          ":loggedIn": true,
          ":status" : "idle",
          ":group" : "DEFAULT",
          ":priority" : 0,
          ":type" : "admin"
        },
        ExpressionAttributeNames: {"#s": "status","#g" : "group", "#t" : "type"}
      };
      var usersResult;
      try {
        usersResult = await dynamoDbLib.call("query", userStatusParams);
        console.log(usersResult);
      }catch (e) {
        console.log("Error occurred querying for users belong to group.");
        console.log(e);
      }

Note the use of IndexName: "typePriorityIndex", the trick here is to find something or make something in your table that the records will all have the same and make that the hash key, then the sort key should be what you will want to sort by, which in my case is priority.

The index looks like this to give an idea.

My serverless file looks like this for defining it

userStatusTable: #This table is used to track a users current status.
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: ${self:custom.userStatusTable}
        AttributeDefinitions: #UserID in this case will be created once and constantly updated as it changes with status regarding the user.
          - AttributeName: userId
            AttributeType: S
          - AttributeName: priority
            AttributeType: N
          - AttributeName: type
            AttributeType: S
        KeySchema:
          - AttributeName: userId
            KeyType: HASH
        ProvisionedThroughput:
            ReadCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}
            WriteCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}
        GlobalSecondaryIndexes:
          - IndexName: typePriorityIndex
            KeySchema:
              - AttributeName: type
                KeyType: HASH
              - AttributeName: priority
                KeyType: RANGE
            Projection:
              ProjectionType: ALL
            ProvisionedThroughput:
              ReadCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}
              WriteCapacityUnits: ${self:custom.dynamoDbCapacityUnits.${self:custom.pstage}}