Database Normalization

2020-01-26 10:29发布

站内文章 / 前沿技术

29 0

走好不送

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm new to database design and I have been reading quite a bit about normalization. If I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?

Thanks

回答1:

Database Normalization is all about constructing relations (tables) that maintain certain functional dependencies among the facts (columns) within the relation (table) and among the various relations (tables) making up the schema (database). Bit of a mouth-full, but that is what it is all about.

A Simple Guide to Five Normal Forms in Relational Database Theory is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is and its significance with respect to database table design. This is a very good "touch-stone" reference.

To answer your specific question properly requires additional information. Some critical questions you have to ask are:

Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g. composed of multiple attributes: Address line, City Name, Postal Code etc.)
What are the other "facts" relating to "Accommodation", "Airport" and "Train Station"?
What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation" and a "Train Station" (these facts are typically called a key or candidate key)?
What functional dependencies exist among Address facts and the facts composing each relations key?

All this to say, the answer to your question is not as straight forward as one might hope for!

Is there such a thing as "over normalization"? Maybe. This depends on whether the functional dependencies you have identified and used to build your tables are of significance to your application domain.

For example, suppose it was determined that an address was composed of multiple attributes; one of which is postal code. Technically a postal code is a composite item too (at least Canadian Postal Codes are). Further normalizing your database to recognize these facts would probably be an over-normalization. This is because the components of a postal code are irrelevant to your application and therefore factoring them into the database design would be an over-normalization.

回答2:

For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.

As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.

Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.

Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.

回答3:

Would I have address columns in each table or an address table that is referenced by the other tables?

Can airports, train stations and accommodation each have a different address format?

A single ADDRESS table minimizes the work necessary dealing with addresses - suite, RR, postal/zip code, state/province...

Is there such a thing as over-normalization?

There are different levels of normalization. I've only encountered what I'd consider poor design rather than normalization.

回答4:

Personally I'd go for another table.

I think it makes the design cleaner, makes reporting on addresses much simpler and will make any changes you need to make to the address schema easier.

If you need to have it denormalized later on you can always create two views that contain the Train station and airport information along with any address information you need.

回答5:

If you have a project/piece of functionality that is very performance sensitive, it may be smart to denormalize the database in some cases. However, this can lead to maintenance issues for various reasons. You may instead want to duplicate the data with cache tables but there are drawbacks to this as well. It's really a case by case basis but in normal practice, database normalization is a good thing. 99% of the non-normalized databases I've seen are not by design, but rather by a misunderstanding/mistake by the developer.

回答6:

Would I have address columns in each table or an address table that is referenced by the other tables?

As others have alluded to, this is not really a question of normalization because you're not attempting to reduce redundancy or organize dependencies. Either way is perfectly acceptable. Moving the addresses to a separate table might make sense if you are going to have centralized validation or business logic specific to addresses.

Is there such a thing as over-normalization?

Yes. As has been mentioned, in large systems (lots of data, lots of transactions, or both) you can normalize to the point where performance becomes an issue. This is why lots of systems use denormalized database for reporting and querying.

In addition to performance though, there is also the issue of how easy the data is to query. In systems where there will be a lot of end-user querying of the data (can be dangerous!), a denormalized structure is easier for most non-technical or non-database people to understand.

Like most things we deal with, it's a trade-off between understanding, performance, and future maintainability and there is rarely a clear-cut answer to where you draw the line in any given system.

With experience, you will learn where the line is best drawn for the systems you write.

With that said, my preference is to err on the side of more vs less normalization.

回答7:

This isn't really what I understand by normalisation. You don't seem to be talking about removing redundancy, just how to partition the storage or data model. I'm assuming that the example of addresses for Accommodation, Train Stations and Airports will all be disjoint?

As far as I know it would only be normalisation if you started thinking along the lines. Postcode is functionally dependent upon street address so should be factored out into its own table.

In which case this could be ever desirable or undesirable dependent upon context. Perhaps desirable if you administer the records and can ensure correctness, and less desirable if users can update their own records.

A related question is Is normalizing a person’s name going too far?

回答8:

If you are using Oracle 9i, you could store address objects in your tables. That would remove the (justified) concerns about address formats.

回答9:

I agree with S.Lott, and would like to add:

A good answer depends on what you know already. The basic "math" of relational database theory, however, defines very well-defined, distinct levels of normalization. You cannot normalize anymore when you've reached the ultimate normal form.
Depending on what you want to model with your three entities, and how you identify them, you can come up with very different conceptual data models, all of which can be represented in a mix of normal forms -- or unnormalized at all (like 1 table for all data with descriptors and NULL holes all over the place...). Consider you normalize your three entities to the ultimate normal form. I can now introduce a new requirement, or use case, or extension, which gives an upto-now descriptive attribute a somehow ordered, or referencing, or structured nature if you look at its content. Then, the model should represent this behavior, and what used to be an attribute perhaps will better be a separate entity referenced by other entities.
Over-normalization? Only in the sense that can you normalize a given model so it gets inefficient to store, or process, on a given DB platform. Depending on what can be handled efficiently there, you might want to de-normalize certain aspects, trading off redundancy for speed (data warehouse dbs do this all the time), and insight, or vice versa.

All (working) db designs I've seen so far either have a rather normalized conceptual data model, with quite some denormalization done at the logical and/or physical data model level (speaking in Sybase PowerDesigner terms) to make the model "manageable" -- either that, or they were not working, i.e. failed because the maintenance problems became kingsize real quick.

回答10:

When you say "address", I presume you mean a complete address, like street, city, state/province, maybe country, and zip/postal code. That's 4 or 5 fields, maybe more if you allow for "address line 1" and "address line 2", care-of's, etc. That should definately be in a separate table, with an "addressid" to link to the Station, etc tables. Otherwise, you are creating 3 separate copies of the same set of field definitions. That's bad news because it creates extra effort to keep them consistent. Like, what if initially you are only dealing with U.S. addresses (I'm an American so I'll assume U.S.), but later you find you also need to allow for Canadians. You'll need to expand the size of the postal code field and add a country code. If there's a common table, then you only have to do this once. If there isn't, then you have to do this three times. And it's likely that the "three times" is not just changing the database schema, but changing every place in your programs that processes an address.

One of the benefits of normalization is to minimize the impact of changes.

回答11:

There are times when you want to denormalize to make queries more efficient. But this should be done very cautiously, only after you have good reason to believe that the fully normalized model creates serious inefficiency problems. In my humble experience, most programmers are far to quick to denormalize, usually with a quick "oh, breaking that out into a separate table is too much trouble".

回答12:

I think in this situation it is OK to have address columns in each table. You'll hardly have an address which will be used more than two times. Most of the adresses will be used just one per entity.

But what could be in an extra table are names of streets, cities, countries...

And most important every train station, accomodoation and airport will probably have just one address so it's an n:1 relation.

回答13:

I can only add one more constructive note to the answers already posted here. However you choose to normalize your database, that very process becomes almost trivial when the addresses are standardized (look the same). This is because as you endeavor to prevent duplicates, all the addresses that are actually the same do look the same.

Now, standardizing addresses is not trivial. There are CASS services which do this for you (for US addresses) which have been certified by the USPS. I actually work for SmartyStreets where this is our expertise, so I'd suggest you start your search there. You can either perform batch processing or use the API to standardize the addresses as you receive them.

Without something like this, your database may be normalized, but duplicate address data (whether correct or incomplete and invalid, etc) will still seep in because of the many, many forms they can take. If you have any further questions about this, I'll personally assisty you.