How do I model data that slowly changes over time?

Let's say I'm getting a large (2 million rows?) amount of data that's supposed to be static and unchanging. Supposed to be. And this data gets republished monthly. What methods are available to 1) be aware of what data points have changed from month to month and 2) consume the data given a point in time?

Solution 1) Naively save every snapshot of data, annotated by date. Diff awareness is handled by some in-house program, but consumption of the data by date is trivial. Cons, space requirements balloon by an order of magnitude.

Solution 2A) Using an in-house program, track when the diffs happen and store them in an EAV table, annotated by date. Space requirements are low, but consumption integrated with the original data becomes unwieldly.

Solution 2B) Using an in-house program, track when the diffs happen and store them in a sparsely filled table that looks much like the original table, filled only with the data that's changed and the date when changed. Cons, model is sparse and consumption integrated with the original data is non-trivial.

I guess, basically, how do I integrate the dimension of time into a relational database, keeping in mind both the viewing of the data and awareness of differences between time periods?

Does this relate to data warehousing at all?

Smells like... Slowly changing dimension?

标签： database sql-server-2005 time

6条回答

傲

2楼-- · 2019-05-12 07:17

I had a similar problem - big flat files imported to the database once per day. Most of the data is unchanging.

Add two extra columns to the table, starting_date and ending_date. The default value for ending_date should be sometime in the future.

To compare one file to the next, sort them both by the key columns, then read one row from each file.

If the keys are equal: compare the rest of the columns to see if the data has changed. If the row data is equal, the row is already in the database and there's nothing to do; if it's different, update the existing row in the database with an ending_date of today and insert a new row with a starting_date of today. Read a new row from both files.
If the key from the old file is smaller: the row was deleted. Update ending_date to today. Read a new row from the old file.
If the key from the new file is smaller: a row was inserted. Insert the row into the database with a starting_date of today. Read a new row from the new file.

Repeat until you've read everything from both files.

Now to query for the rows that were valid at any date, just select with a where clause test_date between start_date and end_date.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

3楼-- · 2019-05-12 07:25

If it was me, I'd save the whole thing every month (not necessarily in a database, but as a data file or text file off-line) - you will be glad you did. Even at a row size of 4096 bytes (wild ass guess), you are only talking about 8G of disk per month. You can save a LOT of months on a 300G drive. I did something similar for years, when I was getting over 1G per day in downloads to a datawarehouse.

0人赞添加讨论(0) 举报

我只想做你的唯一

4楼-- · 2019-05-12 07:28

This sounds to me rather like the problem faced by source code version control systems. These store patches which are used to create the changes as they occur. So if a file does not change, or only a few lines change, the patch that needs to be stored is relatively very small. The system also stores which version each patch contributes to. So, when viewing a particular version of a particular file, the initial version is recovered and all the patches, up to the version requested are applied.

In your, very general, situation, you need to divide up your data into chunks. Hopefully there are natural divisions you can use, but if this division has to be arbitrary that's should be OK. Whenever a change occurs, store the patch for the affected chunk and record a new version. Now, when you want to view a particular date, find the last version that predates the view date, apply the patches for the chunk that has been requested, and display.

0人赞添加讨论(0) 举报

Luminary・发光体

5楼-- · 2019-05-12 07:28

Could you do the following:

1. Each month BCP all data into a temporary table
2. Run a script or stored procedure to update the primary table 
   (which has an additional  DateTime column as  part of a composite key), 
   with any changes made.
3. Repeat each month.

This should give you a table, which you can query data for at a particular date.

In addition each change will be logged, and the size of the table shouldn't change dramatically over time.

However, as a backup to this, I would store each data file as Brennan suggests.

0人赞添加讨论(0) 举报

疯言疯语

6楼-- · 2019-05-12 07:32

You could also take a leaf from the datawarehousing book. There are basically three ways of of dealing with changing data. Have a look at this wikipedia article for SCD's but it is in essence tables: http://en.wikipedia.org/wiki/Slowly_changing_dimension

0人赞添加讨论(0) 举报

劳资没心，怎么记你

7楼-- · 2019-05-12 07:36

A lot of this depends on how you're storing the data. There are two factors to consider:

How oftne does the data change?
How much does the data change?

The distinction is important. If it changes often but not much then annotated snapshots are going to be extremely inefficient. If it changes infrequently but a lot then they're a better solution.

It also depends on if you need to see what the data looked like at a specific point in time.

If you're using Oracle, for example, you can use flashback queries to see a consistent view of the data at some arbitrary point.

Personally I think you're better off storing it incrementally and, at a minimum, using some form of auditing to track changes so you can recover an historic snapshot if it's ever required. But like I said, this depends on many factors.

0人赞添加讨论(0) 举报

How do I model data that slowly changes over time?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间