How to design this NoSQL DB

2019-08-29 01:40发布

问题:

I am trying to create a simple application for myself that uses DynamoDB. I never used NoSQL on an advanced level, only to store a value here and there.

The application is a logger. I will log something and Dynamo will log the date and count for the day.

For example a user logs multiple things today it will just say todays date and logged_times: 5

I can then have a query to grab a total sum of all the logged_times within the past week / day / month etc.

My question is how do you structure a NoSQL database to do something like this that is efficient?

回答1:

Few concepts of NOSQLdb

  1. writes should be equally spread out on primary keys.
  2. read should be equally spread out on primary keys.

The obvious thing that comes to mind looking at given problem and dyanamodb schema is

have key logs as primary key and timestamp as secondary key. And to do an aggregation use

select * where pk=logs and sk is_between x and y

but this will violate both the concepts. We are always writing on a single pk and always reading from the same.

Now to this particular problem, Our PK should be random enough (so that no hot keys) and deterministic enough (so that we can query)

we will have to make some assumptions about application while designing keys. let's say we decide that we will update every hour. hence can have 7-jan-2018-17 as a key. where 17 means 17th hour. this Key is deterministic but it is not random enough. and every update or read on 7th jan will mostly be going to same partition. To make the key random we can calculate hash of it using hashing algo like md5. let's say after taking hash, our key becomes 1sdc23sjdnsd. This will not make any sense if you are looking at table data. But if you want to know the event count on 7-jan-2018-17 you just hash the time and do a get from dynamodb with the hashkey. if you want to know all the events on 7-jan-2018 you can do repeated 24 gets and aggregate the count.

Now this kind of schema will have issues where

  1. If you decide to change from hourly to minute basis.

  2. If most of your queries are run time like get me all the data for last 2,4,6 days. It will mean too many round trips to db. And it will be both time and cost inefficient.

Rule of thumb is when query patterns are well defined, use NOSQL and store the results for performance reasons. If you are trying to do a join or aggregation sort of queries on nosql, it is force fitting your use case based on your technology choice.

You can also looks at aws recommendation of storing time series data.