Is it worth breaking out address information into

2020-05-19 05:37发布

I have a table called "Person" with the following fields

  • Id (Primary Key)
  • FirstName
  • LastName
  • DateOfBirth
  • City
  • State
  • Country

Should things like City, or State or Country be normalized and broken up into their own table and then this table have CityId and StateId columns. We were having a debate whether this was a good or bad decision.

To add, I do have a City and a State table (for other reasons not related to this person table). I am curious around answers with or without this additional fact.

10条回答
Melony?
2楼-- · 2020-05-19 06:12

I would consider breaking out the city, state and country into a single 'address (or city)' table which contains the state and country replicated across the rows. For the number of unique cities in the world, this is not a real database query cost.

It also depends on the number of records you are expecting to have - if the total person count will always be less than say 100,000 then is it really worth the effort to normalize the data?

Having a flat data structure makes queries and testing so much simpler, so unless there is a performance or disk space problem, then maybe it is best to 'keep it simple'.

查看更多
一纸荒年 Trace。
3楼-- · 2020-05-19 06:16
  • If this would be a relatively small database and you plan to let the user input the address himself, you should let the table as it is. Even though that will increase the table size(in bytes), because of the duplicate storage of the city, state and country names.

  • If this would be a relatively huge database and you want the user to select the city, state and country names from a list, then you need to separate this columns to another table. Also, for this to work, you have to populate this table yourself. The advantages would be a smaller table for users and addresses for too.

查看更多
来,给爷笑一个
4楼-- · 2020-05-19 06:21

The problem with {country,state,city} is that they appear to be a candidate key for the referenced table. In SQL, {country,state,city} cannot be a candidate key (or even a primary key), if state (or country) can be missing or NULL. (this could be avoided by allowing an empty string for them, which is different from NULL, but that would be an ugly hack, IMO) The same would go for zipcode, which could only be made into a candidate key by adding country to it. And both could be missing, unknown or NULL.

The only way around the crippled candidate keys would be to demote them to (non-unique) indexes, and add a surrogate primary key, like in:

CREATE TABLE cities
    ( city_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , country_name varchar -- you _could_ squeeze this out into a separate "countries" table
    , state_name varchar   -- you could even squeeze this out, but it would need a composite FK
    , city_name varchar NOT NULL
    );

CREATE TABLE adresses
    ( person_id INTEGER NOT NULL PRIMARY KEY -- could be a serial
    , last_name varchar NOT NULL
    , first_first_name varchar
    , gender CHAR(1)
    , dob DATE
    , city_id INTEGER references cities(city_id) -- could be NOT NULL
    );

WRT {city,state} : you could squeeze these out into a junction table (this basically is a BCNF problem, maybe even a 4NF problem, if all the join fields were non-NULLABLE) like in:

    --
    -- Plan B:
    --
CREATE TABLE country2
    ( country_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , country_name varchar NOT NULL
    , country_iso varchar
    -- ...
    , UNIQUE (country_name)
    );

CREATE TABLE country_state2
    ( cs_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , country_id INTEGER NOT NULL REFERENCES country2(country_id)
    , state_name varchar
    );
CREATE TABLE cities2
    ( city_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , cs_id INTEGER REFERENCES country_state2(cs_id)
    , city_name varchar NOT NULL
    );

CREATE TABLE adresses2
    ( person_id INTEGER NOT NULL PRIMARY KEY -- could be a serial
    , last_name varchar NOT NULL
    , first_first_name varchar
    , gender CHAR(1)
    , dob DATE
    , city_id INTEGER references cities2(city_id) -- could be NOT NULL
    );

Whether you should actually do this is a matter of taste (see @Joel Brown's answer). The normallisation would certainly help in case of massive renaming operations, like the merging of municipalities in the OQ. For small sets of addresses (upto maybe a few thousand), the extra complexity would probably cost more than it would gain. This complexity is particularly costly for the front-end applications used to maintain the data. For the DBMS, a few joins would not cost that much (for small sizes) and could even help performance (for larger sizes). Normalisation is not bad for performance.

UPDATE (after Mike Sherill catcall's comment):

If we could impose NOT NULL constraints on {country,state,city} (or there ids), we could also impose UNIQUE constraints on the (composite) candidate keys that they are part of: -- -- Plan C: -- CREATE TABLE country3 ( country_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ... , country_name varchar NOT NULL , country_iso varchar , UNIQUE (country_name) );

CREATE TABLE country_state3
    ( cs_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , country_id INTEGER NOT NULL REFERENCES country3(country_id)
    , state_name varchar NOT NULL
    , UNIQUE (country_id,state_name)
    );
CREATE TABLE cities3
    ( city_id INTEGER NOT NULL PRIMARY KEY -- could be a serial ...
    , cs_id INTEGER NOT NULL REFERENCES country_state3(cs_id)
    , city_name varchar NOT NULL
    , UNIQUE (cs_id,city_name)
    );

CREATE TABLE adresses3
    ( person_id INTEGER NOT NULL PRIMARY KEY -- could be a serial
    , last_name varchar NOT NULL
    , first_first_name varchar
    , gender CHAR(1)
    , dob DATE
            -- allowing NULL here allows for 'embryonic' records without city/state/country info.
    , city_id INTEGER references cities3(city_id)
    );

Though this NOT NULL constraint will avoid duplicates in {city,state,country}, it will also impose them to being NOT NULL, obviously. This might be impossible or invalid in other countries (than Canada or the US). In the Netherlands, we don't have state or county; we do have provincie, which is hardly used (only to disambiguate, if needed) Similar for the French departements, IIRC.

查看更多
我命由我不由天
5楼-- · 2020-05-19 06:24

Before I get started, I want to point out that {city, state, country} isn't an address.

Should things like City, or State or Country be normalized and broken up into their own table and then this table have CityId and StateId columns. We were having a debate whether this was a good or bad decision.

Normalization is good. I'm almost always advocating for normalization.

But using ID numbers instead of text has nothing to do with normalization. Substituting "CityId" for "City" and "StateId" for "State" has no effect on the normal form of the table. If it were in 3NF before that change, it will still be in 3NF after that change.

You can increase data integrity with a foreign key reference. Data integrity is also good. But this, like many other database design decisions, doesn't have anything to do with normalization.

The simplest way to increase data integrity for cities would be to select distinct cities into a new table. (PostgreSQL syntax.)

select distinct city, state, country
into new_table
from person;

You need city, state, and country to represent the "full name" of a city. You also need a key.

alter table new_table
add primary key (city, state, country);

Now you can declare a foreign key constraint to guarantee that {city, state, country} will always reference one single row in that new table.

alter table Person
add constraint city_state_country_from_new_table
foreign key (city, state, country)
references new_table (city, state, country)
on update cascade;

I wouldn't worry about the performance of cascading updates for this kind of table. (Unless I were using Oracle; Oracle doesn't support cascading updates.) These kinds of names change rarely, and I know PostgreSQL can cascade updates to 3 million rows in a table of 50 million rows in less than 3 seconds on my desktop. My desktop isn't anything special, and it's running 3 database management systems and two web servers. If I had bigger tables and needed more time, I'd schedule the change during a maintenance window.

You can increase data integrity for states in the same way.

select distinct state, country
into another_new_table
from new_table;
etc., etc.

Having said all that, adding a surrogate key to the new_table is a defensible design decision, but only if you spend some time thinking about it. (Not thinking is never defensible.)

The most immediate effect of replacing {city, state, country} with a surrogate key is that you now need a join in every query on a table that required no joins before. You can test the effect on performance with random sample data. Until you have many millions of rows, you'll probably find that the natural key is faster than a join on the surrogate key. That's what I found when I tested.

查看更多
三岁会撩人
6楼-- · 2020-05-19 06:25

Yes, almost certainly. If a country or city changes a name, you change it in one place and all references are automatically updated.

Splitting also allows you to add other attributes to a country or city, i.e. the continent it's in, etc. You can't easily do this without a separate table.

Finally if you want a list of countries (to populate a list box for example) you have a single place to reference. (Otherwise you'd end up doing some SELECT DISTINCT from your person table, which is dubious.)

查看更多
我命由我不由天
7楼-- · 2020-05-19 06:29

Normalizing address into a hierarchy is a questionable proposition. It really depends on what you mean to do with your address data.

The idea of normalizing to avoid update anomalies is a little dubious. How often do cities, states or countries actually change names? Furthermore, if this were to happen, how likely would it be that the change would be wholesale? (i.e. every instance of old name X changes to new name Y). I can tell you what happened in practice in Canada when there was a flurry of municipal amalgamations in the 2000's was that boundaries were redrawn, and that lots of old names stuck around, just with smaller territories than before.

The fact is that things like municipality names can be loosely defined. For example, where I grew up, my address had three officially recognized municipality names according to the postal authority: WILLOWDALE, NORTH YORK, TORONTO - all of which were valid options, although one was "more official" than the others. The problem is that all of Willowdale is in North York, but North York also contains "Downsview" and others.

Other frequent arguments for normalizing addresses include: ensure proper spelling and providing a basis for territory management. Given the vagaries of address data quality, these arguments are not convincing.

The best way to ensure address data quality is to keep your addresses in a relatively flat, relatively simple structure and to employ one or more address quality tools that use postal authority data to match and standardize your addresses. Keep city, state and postal code in their own fields, by all means, but don't keep them in distinct tables. This is actually more flexible than a normalized structure while producing more reliable results overall.

Similarly, territory management is best done at a more granular level than municipality. Some municipalities are enormous and names can be ambiguous. Instead use a postal code or ZIP+4 (depending on jurisdiction). This is much more granular and unambiguous. Again, an address data quality tool will ensure that you have proper postal coding on your addresses.

查看更多
登录 后发表回答