MongoDB beginner - to normalize or not to normaliz

2019-04-26 17:14发布

问题:

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:

Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:

  • houses - the real estate asset at hand
  • owners - the owner of the house (one-to-many relationship with houses)
  • agencies - the real-estate broker agency (many-to-many relationship with houses)
  • images - many-to-one relationship with houses
  • reviews - many-to-one relationship with houses

I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).

Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!

Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?

This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.

Thanks,
Ajar

回答1:

Try this approach:

Work out which entity (or entities) are the hero(s)

With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.

Work out the ownerships

Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.

Work out whether it actually matters that data is de-normalised

You will have agency (and probably owner) details spread across multiple houses. Does that matter?

Your house collection will probably look like this:

house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}

*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that



回答2:

Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.

Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/



回答3:

I would just like to give a normalization refresher from the MongoDB's perspective -

What are the goals of normalization?

  • Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
  • Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
  • Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.