What is the best way to store historical data in S

2020-06-08 03:28发布

My simplified and contrived example is the following:-

Lets say that I want to measure and store the temperature (and other values) of all the worlds' towns on a daily basis. I am looking for an optimal way of storing the data so that it is just as easy to get the current temperature in all the towns, as it is to get all the temperature historically in one town.

It is an easy enough problem to solve, but I am looking for the best solution.

The 2 main options I can think of are as follows:-

Option 1 - Same table stores current and historical records

Store all the current and archive records in the same table.

i.e.

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This would keep everything simple, but what would be the most efficient query to get a list of towns and there current temperature? Would this scale once the table has millions of rows in? Is there anything to be gained by having some sort of IsCurrent flag in the table?

Option 2 - Store all archive records in a separate table

There would be a table to store the current live measurements in

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

And a table to store historical archived date (inserted by a trigger perhaps)

CREATE TABLE [dbo].[WeatherMeasurementHistory](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This has the advantages of keeping the main current data lean, and very efficient to query, at the expense of making the schema more complex and inserting data more expensive.

Which is the best option? Are there better options I haven't mentioned?

NOTE: I have simplified the schema to help focus my question better, but assume there will be alot of data inserted each day (100,000s of records), and data is current for one day. The current data is just as likely to be queried as the historical.

8条回答
Bombasti
2楼-- · 2020-06-08 04:14

I would use a single table with index views to provide me with the latest information. SQL 2005 and 2008 server are designed for data warehousing so should preform well under this condition.

If you have a data pattern that requires writing to the db often, then the best choice would be to have an active table and archive table that you batch update at some interval.

查看更多
放荡不羁爱自由
3楼-- · 2020-06-08 04:21

Instead of trying to optimize relational databases for this, you might want to consider using a Time series database. These are already optimized for dealing with time-based data. Some of their advantages are:

  • Faster at querying time-based keys
  • Large data throughput
    • Since default operation is just an append, this can be done very quickly. (InfluxDb supports millions of data points per second).
  • Able to compress data more agressively
  • More user-friendly for time-series data.
    • The API's tend to reflect typical use-cases for time-series data
    • Aggregate metrics can be automatically calculated (e.g. windowed averages)
    • Specific visualization tools are often available.

Personally I liked using the open source database InfluxDB, but other good alternatives are available.

查看更多
登录 后发表回答