A good way to quickly survey the information in a database is to apply a tool that automatically creates a database diagram of all tables and all relationships between them.
In my experience, such tools use foreign keys as the relationships, which most of the databases I try them do not contain. Sure, they satisfy constraints corresponding to foreign keys, but do not enforce them. And I'll end up with a 'diagram' consisting of a bunch of unrelated tables.
So what I'm looking for is software that can compute "undeclared foreign keys" and either
- uses them as table relations in a database diagram, or
- generates SQL code for corresponding foreign key declarations
Do you know any tools, free if possible, that can already do this?
The following products are all claiming to provide foreign-keys discovery abilities:
ERwin http://www.ascent.co.za/products/ca_erwin_data_profiler.html
Informatica https://community.informatica.com/onlinehelp/analyst/961/en/index.htm#page/data-discovery-guide/GUID-33EAF039-ECFC-49FD-96F4-A2C2A4EB857F.1.148.html
and XCaseForI http://xcasefori.com/discovering/index.html
Statistical methodologies able to provide a kind of similarity rank like range distribution and creation time as suggested by Kirk, seems to be the right way. .. I'd need to implement it using SAS EG or any free tool.
Interesting question. You're looking to parse a database schema and data to determine which tables are relevant or should be related to each other, without any strict definition of the relationship. In effect, you're trying to infer a relationship.
I see two ways that you can infer such a relationship. First let me say that your approach might vary depending on the databases you're working with. A number of questions spring to mind (I don't want answers, but they are worth reflecting on)
Note that this type of inference will almost certainly give false results, and is built on a lot of assumptions.
So I offer two approachs that I'd use in concert.
Inferring a relationship through structure / naming (symbolic analysis)
Common database design is to name a PK column after the table name (e.g.
CustomerId
on tableCustomer
), or alternatively name the PK column simplyId
.A table with a FK relationship to another often names its related column the same as the related table. In the
Order
table I'd expect aCustomerId
column which refers to theCustomerId
/Id
column in theCustomer
table.This type of analysis would include
FirstCustomerId
&SecondCustomerId
both refer to theCustomerId
column in theCustomer
table)Inferring a relationship through data (statistical analysis)
Looking at data, as you suggest you have done in your comments, will allow you to determine 'possible' references. If the
CustomerId
column in theOrder
table contains values which don't exist in theId
column of theCustomer
table then it's reasonable to question that this is a valid relationship (although you never know!)A simple form of data analysis is using dates and times. Rows that were created with close proximity to one another are more likely to be related to one another. If, for every
Order
row that was created, there also exist between 2 and 5Item
rows created within a few seconds, then a relationship between the two is likely.A more detailed analysis might look at the range and distribution of used values.
For example, if your
Order
table has aSt_Id
column - you might infer using symbolic analysis that the column is likely to relate to either aState
table or aStatus
table. TheSt_Id
column has 6 discrete values, and 90% of the records are covered by 2 values. TheState
table has 200 rows, and theStatus
table has 9 rows. You could quite reasonably infer that theSt_Id
column relates to theStatus
table - it gives a more greater coverage of the rows of the table (2/3 of the rows are 'used', whereas only 3% of the rows in theState
table would be used).If you perform data analysis on existing databases to gather 'real life data', I'd expect some patterns that could be used as guides to structure inference. When a table with a large number of records has a column with a small number of values repeated many times (not necessarily in order), it's more likely to this column relates to a table with a correspondingly small number of rows.
In summary
Best of luck. It's an interested problem, I've just thrown some ideas out there but this is very much a trial & error, data gathering and performance tuning situation.
This is a non-trivial exercise in most cases. If you are lucky enough to be analysing a schema for a modern framework, such as Ruby on Rails, or CakePHP or similar, and the developers have been stringent about following column conventions, then you have a reasonable chance of finding many, but not all, of the implied relationships.
I.e. if your tables use columns like
user_id
to refer to entries in theusers
tables.Be aware: some entity names may pluralise irregularly (
entity
being a good example:entities
, notentitys
) and these are harder to catch (but still possible). However, keys such asadmin_id
which the developers join with the users table onuser.id
can't be inferred. You would need to handle those cases manually.You didn't specify an RDBMS, but I used MySQL a lot, and I'm currently working on this problem for myself.
The following MySQL script will infer most relationships implied by column names. It then lists any relationships that it could not find table names for, so at least you know which ones you're missing. The inferred parent and child are listed, along with singular and plural names, plus the implied relationship:
This will return results like this:
Then you can examine any naming convention exceptions detected with:
This will return results like this, which you can process manually:
You can modify these scripts to spit out whatever is useful to you, include foreign key create statements, if you want to. Here, the final column is a simple 'has many' relationship statement. I use this in a tool I've built called called pidgin, which is rapid modelling tool that draws relationship diagrams on the fly based on relationship statements written a very simple syntax (called 'pidgin'). You can check it out at http://pidgin.gruffdavies.com
I've run the above script on a demo DB to show you the sort of results you can expect:
I haven't catered for the irregular plurals in my script, but I might have a go at that too, at least for the case of entities ending in -y. If you want to have a try at that yourself, I'd recommend writing a stored function that takes
<name>_id
column names as a parameter, strips the_id
part and then applies some heuristics to attempt to pluralise correctly.Hope that's useful!
I don't know about the softwares which may help in searching what you require, but The following query will help to get you started. It lists all Foreign Key Relationships within the current database.
Hope this helps.