Database versioning without history tables

2019-01-20 16:04发布

I was going through this post for record level versioning of tables. I noticed that the architecture deals with the usage of history tables. However, my scenario does not require rollback but retrieving point in time records. This is where I have tried with a design on using a single table for versioning. Note that this is a bare bone table data (no constraints, indices, etc.). I intend to index based on id since this involves group by clause on the column.

For example, I have got a table Test where

id is the identifier,

modstamp is the timestamp of the data (never null)

In addition to the columns above, the table will contain bookkeeping columns

local_modstamp is the timestamp at which the record was updated

del_modstamp is the timestamp at which the record was deleted

During backup, all the records are obtained from the source and inserted where the records would have the values local_modstamp = null and del_stamp = null.

id |modstamp                   |local_modstamp |del_modstamp |
---|---------------------------|---------------|-------------|
1  |2016-08-01 15:35:32 +00:00 |               |             |
2  |2016-07-29 13:39:45 +00:00 |               |             |
3  |2016-07-21 10:15:09 +00:00 |               |             |

Once the records are obtained, these are the scenarios for handling the data (assuming the reference time [ref_time] is the time at which the process is run):

  1. Insert as normal.

  2. Update: Update the most recent record with local_modstamp = ref_time. Then insert the new record. The query would be: update test set local_modstamp = where id = and local_modstamp is not null and del_modstamp is not null insert into test values(...)

  3. Delete: Update the most recent record with del_modstamp = ref_time. update test set del_modstamp = where id = and local_modstamp is not null and del_modstamp is not null

The design aims at getting the latest records where local_modstamp is not null and del_modstamp is not null. However, I ran into an issue where I intend to retrieve point in time using the query (inner-most query):

select id, max(modstamp) from test where modstamp <= <ref_time> and (del_modstamp is null || del_modstamp <= <ref_time>) group by id;

It seems that I have made a mistake (have I?) of using null as a placeholder to identify the latest records of the table. Is there a way to use the existing design to obtain the point in time records?

If not, I guess the probable solution is to set the local_modstamp to the latest records. This would require to update the logic using max(local_modstamp) in case of updates. Can I persist on my existing architecture to achieve in retrieving the point in time data?

I am using SQL-Server right now but this design may be extended to other database products too. I intend to use a more general approach to retrieve the data instead of using vendor specific hacks.

1条回答
祖国的老花朵
2楼-- · 2019-01-20 16:26

Introducing Version Normal Form. Consider this table:

create table Entities(
    ID     int identity primary key,
    S1     [type],  -- Static data
    Sn     [type],  -- more static data
    V1     [type],  -- Volatile data
    Vn     [type]   -- more volatile data
);

Static data is data that does not change during the lifetime of the entity or that doesn't require tracking. Volatile data changes and those changes must be tracked.

Move volatile attributes to a separate table:

create table EntityVersions(
    ID        int  not null,
    Effective date not null default sysdate(),
    Deleted   bit  not null default 0,
    V1        [type],
    Vn        [type],
    constraint PK_EntityVersions primary key( ID, Effective ),
    constraint FK_EntityVersionEntity foreign key( ID )
        references Entities( ID )
);

The Entities table no longer contains the volatile attributes.

An insert operation creates the master entity record with static data, generating the unique ID value. That value is used to insert the first version with the initial values of the volatile data. An update generally does nothing to the master table (unless a static value is actually changed) and a new version of the new volatile data is written to the version table. Note no changes are made to existing versions, particularly the latest or "current" version. The new version is inserted, end of operation.

To "undo" the latest version, or any version actually, simply delete that version from the version table.

For example, an Employees table with the following attributes:

EmployeeNum, HireDate, FirstName, LastName, PayRate, Dept, PhoneExt

EmployeeNum will, of course, be static along with HireDate and FirstName. PhoneExt may change from time to time but we don't care. So it is designated static. The final design is:

Employees_S
===========
  EmployeeNum (PK), HireDate, FirstName, PhoneExt

Employees_V
===========
  EmployeeNum (PK), Effective (PK), IsDeleted, LastName, PayRate, Dept

On 1 Jan, 2016 we hired Sally Smith. The static data is inserted into Employees_S generating an EmployeeNum value of 1001. We use that value to also insert the first version.

Employees_S
===========
  1001, 2016-01-01, Sally, 12345

Employees_V
===========
  1001, 2016-01-01, 0, Smith, 35.00, Eng

On 1 Mar, she gets a pay raise:

Employees_S
===========
  1001, 2016-01-01, Sally, 12345

Employees_V
===========
  1001, 2016-01-01, 0, Smith, 35.00, Eng
  1001, 2016-03-01, 0, Smith, 40.00, Eng

On 1 May, she gets married:

Employees_S
===========
  1001, 2016-01-01, Sally, 12345

Employees_V
===========
  1001, 2016-01-01, 0, Smith, 35.00, Eng
  1001, 2016-03-01, 0, Smith, 40.00, Eng
  1001, 2016-05-01, 0, Jones, 40.00, Eng

Note that versions of the same entity, other than the restriction that the Effective dates cannot be the same, are completely independent of each other.

To see what the current state of employee 1001 looks like, here is the query:

select  s.EmployeeNum, s.HireDate, s.FirstName, v.LastName, v.PayRate, v.Dept, s.PhoneExt
from    Employees_S s
join    Employees_V v
    on  v.EmployeeNum = s.EmployeeNum
    and v.Effective = ( select  Max( Effective )
                        from    Employees_V
                        where   EmployeeNum = v.EmployeeNum
                            and Effective <= SysDate() )
where   s.EmployeeNum = 1001
    and v.IsDeleted = 0;

Here's the cool part. To see what the state of employee 1001 looked like on, say 11 Feb, here is the query:

select  s.EmployeeNum, s.HireDate, s.FirstName, v.LastName, v.PayRate, v.Dept, s.PhoneExt
from    Employees_S s
join    Employees_V v
    on  v.EmployeeNum = s.EmployeeNum
    and v.Effective = ( select  Max( Effective )
                        from    Employees_V
                        where   EmployeeNum = v.EmployeeNum
                            and Effective <= '2016-02-11' )
where   s.EmployeeNum = 1001
    and v.IsDeleted = 0;

It's the same query -- except for the last line of the subquery. Current and historical data reside in the same table and are queried with the same statement.

Here's another cool feature. It's 1 Jul and we know that on 1 Sep, Sally is going to transfer to the marketing dept, with another pay raise. The paperwork has already gone through. Go ahead and insert the new data:

Employees_S
===========
  1001, 2016-01-01, Sally, 12345

Employees_V
===========
  1001, 2016-01-01, 0, Smith, 35.00, Eng
  1001, 2016-03-01, 0, Smith, 40.00, Eng
  1001, 2016-05-01, 0, Jones, 40.00, Eng
  1001, 2016-09-01, 0, Jones, 50.00, Mkt

The next-to-last version will still show up as the current version but the first query executed on or after 1 Sep will show the Marketing data.

Here are the slides of a presentation I have made a few times at tech fairs. It contains more details about how all the above can be done including the queries. And here is a document that goes into a lot more detail.

查看更多
登录 后发表回答