Algorithms for Updating Relational Data

2020-06-06 06:33发布

问题:

What algorithms are known to perform the task of updating a database by inserting, updating, and deleting rows in the presence of database constraints?

More specifically, say that before images of rows to be deleted, after images of rows to be inserted, and both images of rows to be updated are in memory. The rows might be for several tables. An exact sequence of updates is either not known or has not been preserved -- only the before images and the after images that the database must ultimately be made to reflect are known.

The database contains primary key, foreign key, and unique index constraints. The problem is to find the sequence of commands to bring the database up to date. For simplicity, I am willing to specify that the primary key of a row will never be modified.

The database system does not support deferred constraint checking. (The solution is trivial for such databases). I must also make the rule that primary key columns may not be updated after insert and it is not allowed to delete a row and re-insert one with the same primary key, even though some algorithms might otherwise find it convenient to do so. (This is necessary for common scenarios where all primary keys are autogenerated by the database system.)

What are the algorithms:

  1. Assuming that foreign key constraints must be enforced at all times but unique indexes are not used.
  2. Assuming that foreign key constraints and unique index constraints must be enforced at all times.

I ask for both because I think #1 might be considerably simpler.

EDIT: The goal here is to solve the problem of lack of deferred constraint checking in a general (or nearly general purpose) way. I suppose high-quality ORM packages must do this. I want an explanation of the algorithms, either provided by you here or externally in an academic paper, etc. I will not consider a pointer to a software package or source code an answer to this question.

NAIVE ALGORITHM:

Loop through the tables and for each row that is added, changed, or deleted generate a single INSERT, UPDATE, or DELETE statement, respectively for each. Loop through the generated statements and apply to the database. If a statement fails to apply, continue with the other statements. Retry the statements that have failed. Keep iterating until there are no more failures or a pass successfully executes no statements. If statements remain, attempt temporary tweaking of the data in the problematic columns to try to get them to succeed.

This is an ugly brute force algorithm and figuring out the "temporary tweaking" part is a challenge of its own. So, I would like an improved and complete algorithm.

EDIT2:

RBarryYoung posts an answer which gets close (but no cigar) to fully solving scenario #1 while solving the most common scenario #2 problems simultaneously. The following is an example of a scenario #1 update pattern I have seen very often in applications but have not yet arrived at the solution. DELETE/UPDATE-INSERT is exactly right a lot of the time in scenario #1, but the trick is to figure out when to deviate from it. I also suspect that deviating from it magnifies the occurence of UNIQUE problems per scenario #2, possibly increasing my interest in solving scenario #2 as well.

Notice that there are no cycles nor are any primary keys modified. However, a foreign key to a parent is modified.

CREATE TABLE A
(
    AId INT NOT NULL PRIMARY KEY
)

CREATE TABLE B
(
    BId INT NOT NULL PRIMARY KEY,
    AId INT NOT NULL FOREIGN KEY REFERENCES A (AId)
)

CREATE TABLE C
(
    CId INT NOT NULL PRIMARY KEY,
    AId INT NOT NULL FOREIGN KEY REFERENCES A (AId),
    BId INT NOT NULL FOREIGN KEY REFERENCES B (BId)
)

Before Images:

A (1)
B (1,1)
C (1,1,1)

After Images:

A (1)
B (2,1) [To be deleted: (1,1)]
C (1,1,2)

Sort Order: A,B,C

The first command is DELETE B (1,1) which fails due to C (1,1,1).

Note that if the third column in C allows NULL (which it does not in this case), a pure solution might involve NULLing it out in an early pass as this would allow the given algorithm to proceed normally and have its usual advantages for dealing with most of the scenario #2 problems. An excellent solution to this problem needs to take things like this into account as well. The full generality of this question is undoubtedly a fascinating problem.

回答1:

Why are you even trying to do this? The correct way to do it is to get the database engine to defer the checking of the constraints until the transaction is committed.

The problem that you pose is intractable in the general case. If you consider just a transitive closure of the foreign keys in the rows you want to update in database then it is only possible to solve this where the graph describes a tree. If there is a cycle in the graph and you can break the cycle by replacing a foreign key value with a NULL then you can re-write one SQL and add another to later update the column. If you can't replace a key value with a NULL then it can't be solved.

As I say, the correct way to do this is to turn off the constraints until all of the SQL has been run and then turn them back on for the commit. The commit will fail if the constraints aren't met. Postgres (for example) has a feature which makes this very easy.



回答2:

I wrote one once, but it's someone else's IP, so I can't go into too much detail. However, I'm willing to tell you the process that taught me how to do this. This was a tool to make a shadow copy of a customer's "database" residing on salesforce.com, written in .NET 1.1.

I started out doing it the brute-force way (create DataSet and database from schema, turn off constraints in the DataSet, iterate through each table, loading rows, ignoring errors, for rows that are not yet in the table, repeat until more rows to add, or no more errors, or no change in the number of errors, then dump the DataSet to the DataBase, until no errors, etc.).

Brute force was the starting point because it wasn't certain that we could do this at all. The "schema" of salesforce.com wasn't a true relational schema. For instance, if I remember correctly, there were some columns which were foreign keys relating to one of several parent tables.

This took forever, even while debugging. I began to notice that most of the time was being spent on handling the constraint violations in the database. I began to notice the pattern of constraint violations, as each iteration converged, slowly, toward getting all the rows saved.

All the revelations I had were due to my boredom, watching the system sit at near 100% CPU for 15-20 minutes at a time, even with a small database. "Necessity is the mother of invention", and "the prospect of waiting another 20 minutes for the same rows, tends to focus the mind", and I figured out how to speed things up by a factor of over 100.



回答3:

OK, I think that this is it, though the Unique Key thing is pretty hard to figure out. Note that any errors encountered in the SQL execution should result in complete rollback of the entire transaction.

UPDATE: The original order that I implemented was:

Each Table, BottumUp(All Deletes for table) Each Table, TopDown(All Updates, then All Inserts)

After a counter-example was posted, I believe that I know haw to correct for the restriced problem only (problem #1, without UCs): by changing the order to:

Each Table, TopDown(All Inserts) Each Table, TopDown(All Updates) Each Table, BottumUp(All Deletes)

This will definitely NOT work with Unique Constraints though, which as far as I can figure will need a row-content based dependency sort (as opposed to the static table FK dependency sort I am currently using). What makes this particularily difficult is that it may require getting info about record-content other than the changed ones (in particular checking for the existence of UC conflict-values and child-dependent records for intermediate steps).

Anyway, here's the current version:

Public Class TranformChangesToSQL
 Class ColVal
    Public name As String
    Public value As String  'note: assuming string values'
 End Class

 Class Row
    Public Columns As List(Of ColVal)
 End Class

 Class FKDef
    'NOTE: all FK''s are assumed to be of the same type: records in the FK table'
    ' must have a record in the PK table matching on FK=PK columns.'
    Public PKTableName As String
    Public FKTableName As String
    Public FK As String
 End Class

 Class TableInfo
    Public Name As String
    Public PK As String                     'name of the PK column'
    Public UniqueKeys As List(Of String)    'column name of each Unique key'
    'This table''s Foreign Keys (FK):'
    Public DependsOn As List(Of FKDef)
    'Other tables FKs that point to this table'
    Public DependedBy As List(Of FKDef)
    Public Columns As List(Of String)
    'note: all row collections are indexed by PK'
    Public inserted As List(Of Row)     'inserted after-images'
    Public deleted As List(Of Row)      'deleted before-images'
    Public updBefore As List(Of row)
    Public updAfter As List(Of row)
 End Class

 Sub MakeSQL(ByVal tables As List(Of TableInfo))
    'Note table dependencies(FKs) must NOT form a cycle'

    'Sort the tables by dependency so that'
    ' child tables (FKs) are always after their parents (PK tables)'
    TopologicalSort(tables)

    For Each tbl As TableInfo In tables
        'Do INSERTs, they *must* be done first in parent-> child order, because:'
        '   they may have FKs dependent on parent inserts'
        '   and there may be Updates that will make child records dependent on them'
        For Each r As Row In tbl.inserted
            Dim InsSQL As String = "INSERT INTO " & tbl.Name & "("
            Dim valstr As String = ") VALUES("
            Dim comma As String = ""
            For Each col As ColVal In r.Columns
                InsSQL = InsSQL & comma & col.name
                valstr = valstr & comma & "'" & col.value & "'"
                comma = ", "    'needed for second and later columns'
            Next
            AddSQL(InsSQL & valstr & ");")
        Next
    Next

    For Each tbl As TableInfo In tables
        'Do UPDATEs'
        For Each aft In tbl.updAfter
            'get the matching before-update row'
            Dim bef As Row = tbl.updBefore(aft.Columns(tbl.PK.ColName).value)
            Dim UpdSql As String = "UPDATE " & tbl.Name & " SET "
            Dim comma As String = ""
            For Each col As ColVal In aft.Columns
                If bef.Columns(col.name).value <> col.value Then
                    UpdSql = UpdSql & comma & col.name & " = '" & col.value & "'"
                    comma = ", "  'needed for second and later columns'
                End If
            Next
            'only add it if any columns were different:'
            If comma <> "" Then AddSQL(UpdSql & ";")
        Next
    Next

    'Now reverse it so that INSERTs & UPDATEs are done in parent->child order'
    tables.Reverse()

    For Each tbl As TableInfo In tables.Reverse
        'Do DELETEs, they *must* be done last, and in child->paernt order because:'
        '   Parents may have children that depend on them, so children must be deleted first,'
        '   and there may be children dependent until after Updates pointed them away'
        For Each r As Row In tbl.deleted
            AddSQL("DELETE From " & tbl.Name & " WHERE " & tbl.PK.ColName & " = '" & r.Columns(tbl.PK.ColName).value) & "';"
        Next
    Next

 End Sub
End Class


回答4:

No, I don't find it fascinating. I don't find the quadrature-of-the-circle-problem fascinating either, and on that topic too there do exist people who strongly, or even violently, disagree with me.

When you say that "it has practical applications", do you mean to say that "the solution to this problem has practical applications" ? I suggest that a solution that does not exist, by definition cannot have "practical applications". (And I do suggest that the solution you're seeking does not exist, just like the quadrature of the circle.)

You argued something about "when other apps hang cascading deletes ...". Your initial problem statement contained no mention whatsoever of "other apps".

The problem I find way more fascinating is "how to build a DBMS that is good enough so that programmers will no longer be facing these kinds of problem, and no longer be forced to ask these kinds of question". Such a DBMS supports Multiple Assignment.

I wish you the best of luck.



回答5:

It sounds like you are looking for a Database Diff tool. Such a tool would look for differences between two tables (or two databases), and generate the necessary scripts to align them.

See the following post for more information:
https://stackoverflow.com/questions/104203/anyone-know-of-any-good-database-diff-tools



回答6:

OpenDbDiff has source code available. You could look at that and figure out the algorithms.

http://opendbiff.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=25206