structure of documents for versioning of a time se

2019-04-16 02:34发布

问题:

I have a time series that grows and is (potentially) revised through time:

on "2013-01-01": First version of the data

"2013-01-01" 10

on "2013-01-02": Data of the 1st of Jan is revised from 10 to 11

"2013-01-01" 11

on "2013-02-01": First version of the data of the 1st of Feb

"2013-01-01" 11
"2013-02-01" 20

on "2013-02-02": Data of the 1st of Feb is revised from 20 to 21

"2013-01-01" 11
"2013-02-01" 21

most frequent queries:

query1: get the most recent version of all dates

"2013-01-01" 11
"2013-02-01" 21

query2: get the time series as it was known at a certain date:

For instance, querying with "2013-02-01", I need to get
"2013-01-01" 11
"2013-02-01" 20

Note that query1 is a the same as query2 but with date = current date

I need help to structure my documents, and as I come from a relational background, I am not sure about the implications of my structure. I have basically identified 2 possible structure, and would be happy to have some feedbacks, or suggestions of other structure.

OPTION A: Each revision in a separate document

{
  "id":"1",
  "date":"2013-01-01",
  "version_date":"2013-01-01",
  "value":10
}

{
  "id":"1",
  "date":"2013-01-01",
  "version_date":"2013-01-02",
  "value":11
}

{
  "id":"1",
  "date":"2013-02-01",
  "version_date":"2013-02-01",
  "value":20
}

{
  "id":"1",
  "date":"2013-02-01",
  "version_date":"2013-02-02",
  "value":21
}

OPTION B: One document contains all the revisions of one date

{
  "id":"1",
  "date":"2013-01-01",
  "values" : [ 
              { "version_date":"2013-01-01",
                "value":10
              },
              {
                "version_date":"2013-01-02",
                "value":11
              }
}

{
  "id":"1",
  "date":"2013-02-01",
  "values" : [ 
              { "version_date":"2013-02-01",
                "value":20
              },
              {
                "version_date":"2013-02-02",
                "value":21
              }
}

In option B, I am also concerned by the fact that it might be a bit more difficult to perform the update query as the document has a growing part, which i am not sure is very well supported by / optimised for mongodb

EDIT: I am also considering option C to speed up query1: (although it might slow down a bit the writing)

{
  "id":"1",
  "date":"2013-01-01",
  "values" : [ 
              { "version_date":"2013-01-01",
                "value":10
              },
              {
                "version_date":"2013-01-02",
                "value":11
              }
  "last_value":11
}

{
  "id":"1",
  "date":"2013-02-01",
  "values" : [ 
              { "version_date":"2013-02-01",
                "value":20
              },
              {
                "version_date":"2013-02-02",
                "value":21
              }
  "last_value":21
}

回答1:

As with all questions like this, you are the only person who can answer this. If you have your data - try both way do some benchmarking on real data with real queries and compare what is better. If you do not have data - try to simulate it.

Keep in mind that with option B and C you have to be aware of 16 Mb limit per document. So if you have a lot of versions - you might reach the limit (but you have to understand that a there should be too many versions to reach 16Mb). Also keep in mind that updating such documents can and up with many moves on the disk.

Option B and C would be nice if you would need to select all revisions of a particular document at once, but I have not found this in your most often queries. Keep in mind that with right indexes you can achieve this as well with option A.



回答2:

There's actually a very recent blog post on the official page covering this topic: http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb Take a look at that and ask any additional questions if required.



回答3:

Considering the above mentioned Options, and your requirements, it would be best to create your structure based on date , like you mentioned in Option-B.Also it would be nice if your date is indexed. Some scenarios (easy reads,updates) that show why this seems to be the proper optimized solution are:

  1. Retrieving all the versions of a particular date.
  2. Retrieve all versions of a certain period of time(that is range, e.g from jan 2012 to feb 2012)
  3. Inserting a new version, you just have to use $push.
  4. Removing an old version , just with a simple query using $pull.