I have made few projects (CMS and EC system) that required to have some data versioned.
Usually I come with that kind of schema
+--------------+
+ foobar +
+--------------+
+ foobar_id +
+ version +
+--------------+
it worked great but I am wondering if there is a better to way do it. The main problem with that solution you have to always use subquery to get the latest version.
i.e.:
SELECT * FROM foobar WHERE foobar_id = 2 and version = (SELECT MAX(version) FROM foobar f2 WHERE f2 = 2)
This render most of the queries more complicate and also have some performance drawbacks.
So it would be nice if you share your experience creating versioned table and what kind of pro and cons of each method.
Thanks
The cleanest solution in my opinion would be to have a History table for each table that requires versioned. In other words, have a foobar table, and then a foobar_History table, with a trigger on foobar that will write existing data to the History table with a timestamp and user that changed the data. Older data is easily queryably, sorted by timestamp descending, and you know that the data in the main table is always the latest version.
You can simplify the query by using a view over your table which filters to the latest version. This only makes the queries look nicer you still have the performance overhead.
Common technique is to add a column
version_status
for current/expired. Also a note, if you keep new and old records in the same table, you should have a business (natural) key for your entity, something likename + pin
, because the primary key will change (increment) with each row.When deciding to keep the record history in the same table -- or move it to a separate one -- consider other tables which have the
foobar_id
as a foreign key. When issuing a new version, should existing foreign keys point to the new PK or to the old PK? If you want to keep history of relationships, you would probably want to keep everything in the same table. If only the new version is important, you may consider to move expired rows to another table -- though it is not necessary.If you had used Oracle you could use analytic functions
select * from ( SELECT a.* , row_number() over (partition by foobar_id order by version desc) rn FROM foobar a WHERE foobar_id = 2 ) where rn = 1
I used to work on a system with historical data, and we had a boolean to indicate which one was the latest version of the data. Of course you need to maintain the consitency of the flag at the applicative level. Then you can create indexes that use the flag and if you provide it in the where clause it's fast.
Pro:
Cons:
Otherwise, you can rely on a separate history table, as suggested in several answers.
Pro:
Cons:
What is best will depend from your use case. I had to deal with a document management system where we wanted to be able to version document. But we also had feature like reverting to old version. It was easier to use a boolean to speed up just the operation that required the last one. If you have real historical data (which never change) probably a dedicated history table is better.
Does the concept of history fit in your domain model? If no, then you have a db schema that differs from your conceptual domain model. If at the domain level, the actual data and the old data need to be handled the same way, having two tables complicates the design. Just consider the case you need to return the complete history (old + new). The easiest solution would be to have one class for each table, but then you can't return a list as easily as if you have only one table. But if these are two distinct concepts, then it's fine to have history be first-class in your design.
I would also recommend this article by M. Fowler also interesting when it comes to dealing with temporal data: Patterns for things that change with time
I prefer to have historical data in another table. I would make foobar_history or something similar and make a FK to foobar_id. This will stop you from having to use a subquery all together. This has the added advantage of not polluting your primary data table with the tons of historical data you probably don't want to see 99% of the time you're accessing it.
You will likely want to make a trigger for updating this data though, as it would require you to copy the current data in to _history and then do the update.