I'd like to move from the SQL approach to the Key Value approach, because I deal with "big data" and would like to benefit from systems like DynamoDB, Riak or Cassandra.
It's quite easy when the data is unrelated, thus one have a document based approach (a primary key + data, but no relations).
I'd appreciate any theoretical or academic input on how to model my data.
I've been using NoSQL in the last 4 years and this is just what I think, what I learnt ... my personal golden rules.
Premise: in the SQL world any possible relation between data, any problem or situation to deal with often come with a precise answer given both from age and "uniqueness" of the product -- people coming from this "perfect world" try to look at the no-sql in the same way, but here any problem can have many solutions (or no solution) based both on the needs of the application and on the product you're using.
Think at queries before writing the model. The term "query-oriented" really fit for the context - go deep with analysis, the more you know about how you'll query your data the best will be the result
Denormalize. Don't think about "a table owns certain data" but more like "a table answers to few queries". -- so your data (or different subset of your data) might be repeated in different tables. This is the norm and a way to avoid joins and relations
It's implicitly an extension of first 2: don't think "the less tables will make the best design" -- the more are the queries and probably the more will be the tables
Study your product -- Each system offers different features -- some of these will offer you "data sorting" for free, some some others may offers collections, callbacks, triggers and so on -- so the model could be quite different from one product to another
Deal with your needs and possibilities -- sometimes you will have to choose, for instance, if creating a new table with data differently sorted or sorting your data client side. There is not a correct answer. If you have few disk space or data to be sorted are small sets you might choose a way, if you have few "computing power" you'd better choose the other
Remember that NoSQL doesn't mean "No SQL" but "Not Only SQL". You can also imagine your schema as an hybrid (I think that https://mariadb.org/ offers this kind of solution) or remember that you can put a layer of Hive/Shark/Pig to perform more complex "backend queries"
If you choose Cassandra, after having studied a little the product, give a look here:
- Become a super modeler
- Datastax data modelling example
HTH,
Carlo