How do I know which is the more appropriate databa

2019-02-21 00:38发布

问题:

Let's assume a database with three tables: Author, Articles, Comments

Assuming the relationship is as follows:

Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article

If I want to know which Author writes the most commented article, I need to select all Articles that belong to a specific Author first. Then I can count the number of comments that were posted under each of those articles. Which in general leads to more complex queries.

If The relationships were as follows:

Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article
**Comment belongs to one Author of the relevant Article**

then I could directly select and count all comments that were posted under the articles of a specific Author, without bothering about including the articles in the query.

But it implies a redundant relationship.

In view of performance, usability and coding best practices, which approach is the better one?

I remember to have read somewhere, that one should only use the first approach, and avoid redundant relationships. But I don't remember where or why. What is a link to a scientific approach to answer this question?

回答1:

Your first approach is a normalized design. It should be the default - it's more maintainable, less error-prone, and requires less code overall.

The second option is a denormalized design. If you think it through, it would require you to find the author for the article every time someone posts a comment, and increment the "comments" field; that's probably more code, and makes writing the comment slower. It also means a simple bug in your "create comment" code could break the application logic, and you probably need to create a transaction for each comment "write" action so you can guarantee that both the comment and update to "authors.comment_count" succeeds or fails.

So, the second option is definitely more complex, and slower for writing comments. It may be faster for querying, but as you'll be joining on primary keys, you will almost certainly not be able to measure that performance impact until you get to a database size of hundreds of millions of records.

In general, I recommend the following approach; take each step only if the previous steps haven't given you enough performance.

  • design a relational model.
  • tune that relational database (indexes, etc.)
  • improve the hardware - RAM, CPU, SSD disks etc.
  • create a measurement rig so you can identify the performance challenges and run experiments. Create benchmarks based on current and expected data sizes; find a way to fill your test rig with dummy data until you have the data volume you need to scale to.
  • run your queries on the test rig. Make sure there are no further performance tweaks from indexing or query optimization.
  • introduce application-level caching. In your example, caching the number of comments for an author for 1 hour may be acceptable.
  • de-normalize your schema. Use your test rig to prove it gives you the performance you expect.
  • look at more exotic data solutions - sharding, data partitioning etc.

Denormalization is so far down the line because it introduces real maintenance risks, makes your code much more complex, and is nowhere near as effective as adding an extra 4GB to your server in most cases.



回答2:

"But I don't remember where or why? Please link to a scientific approach to answer this question."

The "scientific approach" is the entire body of normalization theory.

The "redundant relationship" creates an additional problem in integrity enforcement. The system must make sure that the comment/author relationship as specified by a user updating the db, is the same as the one implied by the comment/article and article/author relationships.

That is a problem of additional complexity for the system when enforcing data integrity, and is a problem of additional complexity for the users doing the updating to ensure that they won't be specifying invalid updates.

So your "second approach" might make querying "simpler" indeed, but only at the expense of creating additional complexities on the "updating" side.



回答3:

Tables represent business/application relation(ship)s/associations. As in the relational model & entity-relationship modeling. Every query result holds the rows of values that are related by some business relationship expressed by the query expression.

Your "relationships" [sic] are FKs (foreign keys). Those are constraints--statements true in every business situation & its database state--saying that if some values are related by a certain business relationship then they are also related by a certain other one. But FKs are neither necessary nor sufficient for using the database--for interpreting it or updating it. They constrain the database state, but they don't tell you what's in it.

Your business relationships & corresponding tables are actually like:

Author authored Article
Commenter commented Comment re Article

Such a statement template denoting a business relationship is its (characteristic) predicate. To query using these it does not matter what the constraints are--if you want the authors who commented on articles authored by themselves that's

/* rows where
FOR SOME a.* & cr.*,
        Author = a.Author
    AND a.Author authored a.Article
    AND cr.Commenter commented cr.Comment re cr.Article
    AND a.Author = cr.Commenter
*/
select Author
from authored a join commented_re cr on a.Author = cr.Commenter

regardless of whether an author can author multiple articles, or multiple authors can author an article, or multiple authors can author multiple articles, or commenters can comment re multiple comments, etc, or commenters can comment re multiple articles, etc, or a comment can be re multiple articles, etc, or authors can comment, or commenters can author, or commenters can only comment on articles they authored (a FK constraint) or authors named 'henk' can comment re at most 7 articles, or any constraint whatsoever.

Normalization replaces a table by selects of it that join back to it, which is the same as saying it replaces a business relationship that is expressible via an AND by others that are expressible by the expressions that were ANDed. It happens that if an author can only write one article and an article can only be written by one author then the AND/join table above might (depending on other things) be a good design but otherwise it would not be a good design, and should be replaced by the separate tables. FDs & other constraints are the post-design table-based expression of corresponding business rules that follow from the chosen business relationships & what business situations can arise.

So your "scientific approach" is proper relational information modeling and database design, including normalization.