Normalization leads to many essential and desirable characteristics, including aesthetic pleasure. Besides it is also theoretically "correct". In this context, denormalization is applied as a compromise, a correction to achieve performance. Is there any reason other than performance that a database could be denormalized?
相关问题
- NOT DISTINCT query in mySQL
- Flush single app django 1.9
- keeping one connection to DB or opening closing pe
- Mysql-installer showing error : Memoy could not be
- Android Room Fetch data with dynamic table name
相关文章
- Connection pooling vs persist connection mysqli
- Speed up sqlFetch()
- How Do I Seed My Database in the setupBeforeClass
- I set a MySQL column to “NOT NULL” but still I can
- Where in Django can I run startup code that requir
- Google OAuth 2.0 User id datatype for MYSQL
- Storing User Settings - anything wrong with using
- Restore deleted records in PostgreSQL
I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.
Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.
Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".
If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:
Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.
Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.
Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.
In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.
There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:
I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.
Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as @aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.