homogeneous vs heterogeneous in documentdb

2020-02-09 17:39发布

问题:

I am using Azure DocumentDB and all my experience in NoSql has been in MongoDb. I looked at the pricing model and the cost is per collection. In MongoDb I would have created 3 collections for what I was using: Users, Firms, and Emails. I noted that this approach would cost $24 per collection per month.

I was told by the people I work with that I'm doing it wrong. I should have all three of those things stored in a single collection with a field to describe what the data type is. That each collection should be related by date or geographic area so one part of the world has a smaller portion to search. and to:

"Combine different types of documents into a single collection and add a field across all to separate them in searching like a type field or something"

I would never have dreamed of doing that in Mongo, as it would make indexing, shard keys, and other things hard to get right.

There might not be may fields that overlap between the objects (example: Email and firm objects)

I can do it this way, but I can't seem to find a single example of anyone else doing it that way - which indicates to me that maybe it isn't right. Now, I don't need an example, but can someone point me to some location that describes which is the 'right' way to do it? Or, if you do create a single collection for all data - other than Azure's pricing model, what are the advantages / disadvantages in doing that?

Any good articles on DocumentDb schema design?

回答1:

Yes. In order to leverage CosmosDb to it's full potential need to think of a Collection is an entire Database system and not as a "table" designed to hold only one type of object.

Sharding in Cosmos is exceedingly simply. You just specify a field that all of your documents will populate and select that as your partition key. If you just select a generic value such as key or partitionKey you can easily separate the storage of your inbound emails, from users, from anything else by picking appropriate values.

class InboundEmail
{
   public string Key {get; set;} = "EmailsPartition";
   // other properties
}

class User
{
   public string Key {get; set;} = "UsersPartition";
   // other properties
}

What I'm showing is still only an example though. In reality your partition key values should be even more dynamic. It's important to understand that queries against a known partition are extremely quick. As soon as you need to scan across multiple partitions you'll see much slower and more costly results.

So, in an app that ingests a lot of user data. Keeping a single user's activity together in one partition might make sense for that particular entity.

If you want evidence that this is the appropriate way to use CosmosDb, consider the addition of the new Gremlin Graph APIs. Graphs are inherently heterogenous as they contain many different entities and entity types as well as the relationships between them. The query boundary of Cosmos is at the collection level so if you tried putting your entities all in different collections none of the Graph API or queries would work.

EDIT: I noticed in the comments you made this statement And you would have an index on every field in both objects. CosmosDb does automatically index every field of every document. They use a special proprietary path based indexing mechanism that ensures every path of your JSON tree has indices on it. You have to specifically opt out of this auto indexing feature.