Inbreeding-immune database structure

2020-07-25 05:46发布

问题:

I have an application that requires a "simple" family tree. I would like to be able to perform queries that will give me data for an entire family given one id from a member in the family. I say simple because it does not need to take into account adoption or any other obscurities. The requirements for the application are as follows:

  • Any two people will not be able to breed if they're from the same genetic line
  • Needs to allow for the addition of new family lines (new people with no previous family)
  • Need to be able to pull siblings, parents separately through queries

I'm having trouble coming up with the proper structure for the database. So far I've come up with two solutions but they're not very reliable and will probably get out of hand quite quickly.

Solution 1 involves placing a family_ids field on the people table and storing a list of unique family ids. Each time two people breed the lists are checked against each other to make sure no ids match and if everything checks out will merge the two lists and set that as the child's family_ids field.

Example:

Father (family_ids: (null)) breeds with Mother (family_ids: (213, 519)) ->
Child (family_ids: (213, 519)) breeds with Random Person (family_ids: (813, 712, 122, 767)) ->
Grandchild (family_ids: (213, 519, 813, 712, 122, 767))

And so on and so forth... The problem I see with this is the lists becoming unreasonably large as time goes on.

Solution 2 uses cakephp's associations to declare:

public $belongsTo = array(
    'Father' => array(
        'className' => 'User',
        'foreignKey' => 'father_id'
    ),
    'Mother' => array(
        'className' => 'User',
        'foreignKey' => 'mother_id'
    )
);

Now setting recursive to 2 will fetch the results of the mother and father, along with their mother and father, and so on and so forth all the way down the line. The problem with this route is that the data is in nested arrays and I'm unsure of how to efficiently work through the code.

If anyone would be able to steer me in the direction of the most efficient way to handle what I want to achieve that would be tremendously helpful. Any and all help is greatly appreciated and I'll gladly answer any questions anyone has. Thanks a lot.

回答1:

In the SQL (more correctly, RDBS) I'd use the following solution:

1) create a table people with the following fields - id, name, father_id, mother_id. The first one is a typical primary key column, father_id and mother_id refer to this column but are NULLable (to allow addition of new family lines).

2) create a table relatives with the following fields - person_id, ancestor_id. Both are not NULL, both form a composite primary key, both also are FK for person.id.

And that's it. No, really! ) Now consider your tasks:

  • add some people without family lines

That's also pretty doable: INSERT INTO people (name) VALUES ('some_name'). The trick is to make another insert related to this fresh person into relatives: INSERT INTO relatives VALUES (%new_person_id%, %new_person_id%)

What's that for? Consider the most common task: add some person which actually has both father and mother listed in your tables already. With this structure it's done as simple as (after inserting the corresponding record into people, and getting this person_id as a result)...

INSERT INTO relatives 
    SELECT %new_person_id%, ancestor_id 
      FROM relatives 
     WHERE person_id IN (%father_id%, %mother_id%);
INSERT INTO relatives VALUES (%new_person_id%, %new_person_id%);
  • any two people will not be able to breed if they're from the same genetic line.

With the structure described above it's rather simple: you have to look for two records in relatives that has the same value in ancestor_id field. For example:

    SELECT COUNT(*) 
      FROM relatives ra 
INNER JOIN relatives rb ON ra.ancestor_id = rb.ancestor_id
     WHERE ra.person_id = %person_a_id%
       AND rb.person_id = %person_b_id%

It's quite easy to look for all ancestors and children in this structure; but I'd still prefer de-normalized approach (i.e., storing father_id and mother_id in the first table) to speed up the look-up for direct parents/children - it actually can be done with the first table alone.

Here's a working (albeit a bit short) SQL Fiddle example to show this in more practical color. )