Background
I'm prototyping a conversion from our RDBMS database to MongoDB. While denormalizing, it seems as if I have two choices, one which leads to many (millions) of smaller documents or one which leads to fewer (hundreds of thousands) large documents.
If I could distill it down to a simple analog, it would be the difference between a collection with fewer Customer documents like this (in Java):
class Customer {
private String name;
private Address address;
// each CreditCard has hundreds of Payment instances
private Set<CreditCard> creditCards;
}
or a collection with many, many Payment documents like this:
class Payment {
private Customer customer;
private CreditCard creditCard;
private Date payDate;
private float payAmount;
}
Question
Is MongoDB designed to prefer many, many small documents or fewer large documents? Does the answer mostly depend on what queries I plan on running? (i.e. How many credit cards does customer X have? vs What was the average amount all customers paid last month?)
I've looked around a lot but I didn't stumble into any MongoDB schema best practices that would help me answer my question.
You'll definitely need to optimize for the queries you're doing.
Here's my best guess based on your description.
You'll probably want to know all Credit Cards for each Customer, so keep an array of those within the Customer Object. You'll also probably want to have a Customer reference for each Payment. This will keep the Payment document relatively small.
The Payment object will automatically have its own ID and index. You'll probably want to add an index on the Customer reference as well.
This will allow you to quickly search for Payments by Customer without storing the whole customer object every time.
If you want to answer questions like "What was the average amount all customers paid last month" you're instead going to want a map / reduce for any sizeable dataset. You're not getting this response "real-time". You'll find that storing a "reference" to Customer is probably good enough for these map-reduces.
So to answer your question directly: Is MongoDB designed to prefer many, many small documents or fewer large documents?
MongoDB is designed to find indexed entries very quickly. MongoDB is very good at finding a few needles in a large haystack. MongoDB is not very good at finding most of the needles in the haystack. So build your data around your most common use cases and write map/reduce jobs for the rarer use cases.
According to MongoDB's own documentation, it sounds like it's designed for many small documents.
From Performance Best Practices for MongoDB:
The maximum size for documents in MongoDB is 16 MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document.
From 6 Rules of Thumb for MongoDB Schema Design: Part 1:
Modeling One-to-Few
An example of “one-to-few” might be the addresses for a person. This
is a good use case for embedding – you’d put the addresses in an array
inside of your Person object.
One-to-Many
An example of “one-to-many” might be parts for a product in a
replacement parts ordering system. Each product may have up to several
hundred replacement parts, but never more than a couple thousand or
so. This is a good use case for referencing – you’d put the ObjectIDs of
the parts in an array in product document.
One-to-Squillions
An example of “one-to-squillions” might be an event logging system
that collects log messages for different machines. Any given host
could generate enough messages to overflow the 16 MB document size,
even if all you stored in the array was the ObjectID. This is the
classic use case for “parent-referencing” – you’d have a document for
the host, and then store the ObjectID of the host in the documents for
the log messages.
Documents that grow substantially over time can be ticking time bombs. Network bandwidth and RAM usage will likely become measurable bottlenecks, forcing you to start over.
First, let's consider two collections: Customer and Payment. Thus, the grain is fairly small: one document per payment.
Next you must decide how to model account information, such as credit cards. Let's consider whether customer documents contain arrays of account information or whether you need a new Account collection.
If account documents are separate from customer documents, loading all of the accounts for one customer into memory requires fetching multiple documents. That might translate into extra memory, I/O, bandwidth, and CPU usage. Does that immediately mean the Account collection is a bad idea?
Your decision affects payment documents. If account information is embedded in a customer document, how would you reference it? Separate account documents have their own _id attribute. With embedded account information, your application would either generate new ids for accounts or use the account's attributes (e.g., account number) for the key.
Could a payment document actually contain all the payments made in fixed timeframe (e.g., day?). Such complexity will affect all code that reads and writes payment documents. Premature optimization can be deadly to projects.
Like account documents, payments are easily referenced as long as a payment document contains only one payment. A new type of document, credit for example, could reference a payment. But would you create a Credit collection or would you embed credit information inside payment information? What would happen if you later needed to reference a credit?
To summarize, I have been successful with lots of small documents and many collections. I implement references with _id and only with _id. Thus, I don't worry about ever-growing documents destroying my application. The schema is easy to understand and index because each entity has its own collection. Important entities aren't hiding inside other documents.
I'd love to hear about your findings. Good luck!