What is Normalisation (or Normalization)?

2018-12-31 06:50发布

Why do database guys go on about normalisation?

What is it? How does it help?

Does it apply to anything outside of databases?

10条回答
妖精总统
2楼-- · 2018-12-31 06:53

It is intended to reduce redundancy of data.

For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization

I'll give a somewhat simplistic example.

Assume an organization's database that usually contains family members

id, name, address
214 Mr. Chris  123 Main St.
317 Mrs. Chris 123 Main St.

could be normalized as

id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27

and a family table

ID, address
27 123 Main St.

Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.

The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.

Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.

查看更多
君临天下
3楼-- · 2018-12-31 06:56

Normalization a procedure used to eliminate redundancy and functional dependencies between columns in a table.

There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functionally identical, except with less redundancy and dependencies.

Redundancy and dependency of data is undesirable because it can lead to inconsisencies when modifying the data.

查看更多
怪性笑人.
4楼-- · 2018-12-31 07:00

Most importantly it serves to remove duplication from the database records. For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.

It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.

My advise is to start with good degree of normalisation and only do de-normalisation when really needed

P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms

查看更多
怪性笑人.
5楼-- · 2018-12-31 07:01

It helps prevent duplicate (and worse, conflicting) data.

Can have negative impact on performance though.

查看更多
春风洒进眼中
6楼-- · 2018-12-31 07:06

Quoting CJ Date: Theory IS practical.

Departures from normalization will result in certain anomalies in your database.

Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.

The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.

Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.

These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.

Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.

Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.

Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.

查看更多
姐姐魅力值爆表
7楼-- · 2018-12-31 07:08

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If some piece of data is duplicated several places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem, usually related to redundancy.

Some typical normalization errors:

(1) Having more than one value in a cell. Example:

UserId | Car
---------------------
1      | Toyota
2      | Ford,Cadillac

Here the "Car" column (which is a string) have several values. That offends the first normal form, which says that each cell should have only one value. We can normalize this problem away by have a separate row per car:

UserId | Car
---------------------
1      | Toyota
2      | Ford
2      | Cadillac

The problem with having several values in one cell is that it is tricky to update, tricky to query against, and you cannot apply indexes, constraints and so on.

(2) Having redundant non-key data (ie. data repeated unnecessarily in several rows). Example:

UserId | UserName | Car
-----------------------
1      | John     | Toyota
2      | Sue      | Ford
2      | Sue      | Cadillac

This design is a problem because the name is repeated per each column, even though the name is always determined by the UserId. This makes it theoretically possible to change the name of Sue in one row and not the other, which is data corruption. The problem is solved by splitting the table in two, and creating a primary key/foreign key relationship:

UserId(FK) | Car               UserId(PK) | UserName
---------------------          -----------------
1          | Toyota            1          | John
2          | Ford              2          | Sue
2          | Cadillac

Now it may seem like we still have redundant data because the UserId's are repeated; However the PK/FK constraint ensures that the values cannot be updated independently, so integrity is safe.

Is it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

  • some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.

  • sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.

查看更多
登录 后发表回答