I was going through this post for record level versioning of tables. I noticed that the architecture deals with the usage of history tables. However, my scenario does not require rollback but retrieving point in time records. This is where I have tried with a design on using a single table for versioning. Note that this is a bare bone table data (no constraints, indices, etc.). I intend to index based on id since this involves group by clause on the column.
For example, I have got a table Test where
id is the identifier,
modstamp is the timestamp of the data (never null)
In addition to the columns above, the table will contain bookkeeping columns
local_modstamp is the timestamp at which the record was updated
del_modstamp is the timestamp at which the record was deleted
During backup, all the records are obtained from the source and inserted where the records would have the values local_modstamp = null and del_stamp = null.
id |modstamp |local_modstamp |del_modstamp |
---|---------------------------|---------------|-------------|
1 |2016-08-01 15:35:32 +00:00 | | |
2 |2016-07-29 13:39:45 +00:00 | | |
3 |2016-07-21 10:15:09 +00:00 | | |
Once the records are obtained, these are the scenarios for handling the data (assuming the reference time [ref_time] is the time at which the process is run):
Insert as normal.
Update: Update the most recent record with local_modstamp = ref_time. Then insert the new record. The query would be: update test set local_modstamp = where id = and local_modstamp is not null and del_modstamp is not null insert into test values(...)
Delete: Update the most recent record with del_modstamp = ref_time. update test set del_modstamp = where id = and local_modstamp is not null and del_modstamp is not null
The design aims at getting the latest records where local_modstamp is not null and del_modstamp is not null. However, I ran into an issue where I intend to retrieve point in time using the query (inner-most query):
select id, max(modstamp) from test where modstamp <= <ref_time> and (del_modstamp is null || del_modstamp <= <ref_time>) group by id;
It seems that I have made a mistake (have I?) of using null as a placeholder to identify the latest records of the table. Is there a way to use the existing design to obtain the point in time records?
If not, I guess the probable solution is to set the local_modstamp to the latest records. This would require to update the logic using max(local_modstamp) in case of updates. Can I persist on my existing architecture to achieve in retrieving the point in time data?
I am using SQL-Server right now but this design may be extended to other database products too. I intend to use a more general approach to retrieve the data instead of using vendor specific hacks.
Introducing Version Normal Form. Consider this table:
Static data is data that does not change during the lifetime of the entity or that doesn't require tracking. Volatile data changes and those changes must be tracked.
Move volatile attributes to a separate table:
The Entities table no longer contains the volatile attributes.
An insert operation creates the master entity record with static data, generating the unique ID value. That value is used to insert the first version with the initial values of the volatile data. An update generally does nothing to the master table (unless a static value is actually changed) and a new version of the new volatile data is written to the version table. Note no changes are made to existing versions, particularly the latest or "current" version. The new version is inserted, end of operation.
To "undo" the latest version, or any version actually, simply delete that version from the version table.
For example, an Employees table with the following attributes:
EmployeeNum will, of course, be static along with HireDate and FirstName. PhoneExt may change from time to time but we don't care. So it is designated static. The final design is:
On 1 Jan, 2016 we hired Sally Smith. The static data is inserted into Employees_S generating an EmployeeNum value of 1001. We use that value to also insert the first version.
On 1 Mar, she gets a pay raise:
On 1 May, she gets married:
Note that versions of the same entity, other than the restriction that the Effective dates cannot be the same, are completely independent of each other.
To see what the current state of employee 1001 looks like, here is the query:
Here's the cool part. To see what the state of employee 1001 looked like on, say 11 Feb, here is the query:
It's the same query -- except for the last line of the subquery. Current and historical data reside in the same table and are queried with the same statement.
Here's another cool feature. It's 1 Jul and we know that on 1 Sep, Sally is going to transfer to the marketing dept, with another pay raise. The paperwork has already gone through. Go ahead and insert the new data:
The next-to-last version will still show up as the current version but the first query executed on or after 1 Sep will show the Marketing data.
Here are the slides of a presentation I have made a few times at tech fairs. It contains more details about how all the above can be done including the queries. And here is a document that goes into a lot more detail.