I'm interested in learning about tools that operate like this:
Given a database schema and some data, suggest whether the schema is likely structured in any particular normal form, and then tell how the schema might be factored to produce further normalization.
Basically, a static analysis tool for database schema design.
Like other static analysis tools, such a database tool need not generate perfect results (I doubt such a "perfect" tool is computationally feasible), or applicable to all database systems, or free/open source, or anything else. The tool need not be stand-alone; it could be bundled as some sophisticated IDE feature. I'm just wondering what's out there.
It can be done, and there are at least 2 commecially available tools that can do normalization for you: GeneXus and DeKlarit.
They use a process named NormalizationBySynthesis
A tool like you describe, that tried to analyze your data and metadata and advise you of possible non-normalized structure, would give wrong suggestions so frequently that it would make Clippy, the Microsoft assistant seem like an indispensable writer's aid.
The process of normalization involves mapping software requirements to a logical data model. An analysis tool cannot know your data requirements any better than you do. So it can't infer from an incorrect database design which parts of it are wrong.
I understand you have qualified the question and you have limited expectations for the tool's capability. But you would want it to be useful for ordinary, everyday tasks -- but it would not be reliable even for the most simple cases.
Compare to static code analysis tools. Suppose you write an application and deliver it to your client, and the client says "why can't I send email from this app?" How would a static code analysis tool tell you that you omitted a desired feature? It can't know those requirements.
Likewise, how would a database normalization analysis tool know whether it's appropriate for the UserAccount
table to have a single MobilePhoneNumber
attribute, or if it would be more appropriate to separate phone numbers into another table so a single user could have multiple phones listed?
I have thought about this problem too. It is theoritically possible and there are some research papers on this topic. there used to be pretty cool tool at www.dbtools.cs.cornell.edu. This was developed by the famous author Raghu Ramakrishnan. He is currently at Yahoo Research. You can refer to the following papers for more information
- Diederich,T. and Miton,J.,(1988),"New Methods and Fast Algorithms for Database Normalization ",ACM Transactions on Database Systems, 13(3),339-365.
- Bernstein, P. A.(1986),"Synthesizing Third Normal Form Relations from Functional Dependencies",ACM Transactions on Database Systems , Vol.1. No. 4, pp 277-298.
- JMathNorm: A Database Normalization Tool Using Mathematica, Lecture Notes In Computer Science; Vol. 4488, Proceedings of the 7th international conference on Computational Science, Part II, Ali Yazici, Ziya Karakaya
The third link is very interesting. Here is the abstract of the paper:
This paper is about designing a
complete interactive tool, named
JMathNorm, for relational database
(RDB) normalization using Mathematica.
It is an extension of the prototype
developed by the same authors [1] with
the inclusion of Second Normal Form
(2NF), and Boyce-Codd Normal Form
(BCNF) in addition to the existing
Third normal Form (3NF) module. The
tool developed in this study is
complete and can be used for real-time
database design as well as an aid in
teaching fundamental concepts of DB
normalization to students with limited
mathematical background. JMathNorm
also supports interactive use of
modules for experimenting the
fundamental set operations such as
closure, and full closure together
with modules to obtain the minimal
cover of the functional dependency set
and testing an attribute for a
candidate key. JMathNorm's GUI
interface is written in Java and
utilizes Mathematica's JLink facility
to drive the Mathematica kernel.
I am also very interested in an answer to this question. If anyone has come across a tool, please let us know!
The open-source tool SchemaSpy detects some "Anomalies", like "Tables without indexes" or "Columns that are flagged as both 'nullable' and 'must be unique'". One of these anomalies is "Tables with incrementing column names, potentially indicating denormalization"
http://schemaspy.sourceforge.net/
I think that's hard to implement.
Like, a table
Id | Name | Surname | SSN
is in 1NF, and
Id | Name | Surname | Mobile
is not, but you can say it neither from the design nor from data, only from the field name.
I've seen once a database for ultrasonography that actually had both GENDER
and LMP
(last menstrual period) in one table.
It can't be done.
Normalization is defined in terms of functional dependencies which
- Cannot be expressed in SQL
- Cannot be inferred from the data
You can look at a table such as
A | B
--+--
1 | 1
1 | 2
and deduce that B is not dependent on A (because it has two distinct values for a single value of A), but you can never deduce that a dependency does exist, e.g. A may or may not depend on B.