Should id or timestamp be used to determine the cr

2019-07-06 09:02发布

问题:

A database table is used to store editing changes to a text document.

The database table has four columns: {id, timestamp, user_id, text}

A new row is added to the table each time a user edits the document. The new row has an auto-incremented id, and a timestamp matching the time the data was saved.

To determine what editing changes a user made during a particular edit, the text from the row inserted in response to his or her edit is compared to the text in the previously inserted row.

To determine which row is the previously inserted row, either the id column or the timestamp column could be used. As far as I can see, each method has advantages and disadvantages.

Determining the creation order using id

  • Advantage: Immune to problems resulting from incorrectly set system clock.

  • Disadvantage: Seems to be an abuse of the id column since it prescribes meaning other than identity to the id column. An administrator might change the values of a set of ids for whatever reason (eg. during a data migration), since it ought not matter what the values are so long as they are unique. Then the creation order of rows could no longer be determined.

Determining the creation order using timestamp

  • Advantage: The id column is used for identity only, and the timestamp is used for time, as it ought to be.
  • Disadvantage: This method is only reliable if the system clock is known to have been correctly set each time a row was inserted into the table. How could one be convinced that the system clock was correctly set for each insert? And how could the state of the table be fixed if ever it was discovered that the system clock was incorrectly set for a not precisely known period in the past?

I seek a strong argument for choosing one method over the other, or a description of another method that is better than the two I am considering.

回答1:

Using the sequential id would be simpler as it's probably(?) a primary key and thus indexed and quicker to access. Given that you have user_id, you can quickly assertain the last and prior edits.

Using the timestamp is also applicable, but it's likely to be a longer entry, and we don't know if it's indexed at all, plus the potential for collisions. You rightly point out that system clocks can change... Whereas sequential id's cannot.

Given your update:

As it's difficult to see what your exact requirements are, I've included this as evidence of what a particular project required for 200K+ complex documents and millions of revisions.

From my own experience (building a fully auditable doc/profiling system) for an internal team of more than 60 full-time researchers. We ended up using both an id and a number of other fields (including timestamp) to provide audit-trailing and full versioning.

The system we built has more than 200 fields for each profile and thus versioning a document was far more complex than just storing a block of changed text/content for each one; Yet, each profile could be, edited, approved, rejected, rolled-back, published and even exported as either a PDF or other format as ONE document.

What we ended up doing (after a lot of strategy/planning) was to store sequential versions of the profile, but they were keyed primarily on an id field.

Timestamps

Timestamps were also captured as a secondary check and we made sure of keeping system clocks accurate (amongst a cluster of servers) through the use of cron scripts that checked the time-alignment regularly and corrected them where necessary. We also used Ntpd to prevent clock-drift.

Other captured data

Other data captured for each edit also included (but not limited to):

User_id
User_group
Action
Approval_id

There were also other tables that fulfilled internal requirements (including automatically generated annotations for the documents) - as some of the profile editing was done using data from bots (built using NER/machine learning/AI), but with approval being required by one of the team before edits/updates could be published.

An action log was also kept of all user actions, so that in the event of an audit, one could look at the actions of an individual user - even when they didn't have the permissions to perform such an action, it was still logged.

With regard to migration, I don't see it as a big problem, as you can easily preserve the id sequences in moving/dumping/transferring data. Perhaps the only issue being if you needed to merge datasets. You could always write a migration script in that event - so from a personal perspective I consider that disadvantage somewhat diminished.

It might be worth looking at the Stack Overflow table structures for there data explorer (which is reasonably sophisticated). You can see the table structure here: http://data.stackexchange.com/stackoverflow/query/new, which comes from a question on meta: How does SO store revisions?

As a revision system, SO works well and the markdown/revision functionality is probably a good example to pick over.



回答2:

Use Id. It's simple and works.

The only caveat is if you routinely add rows from a store-and-forward server so rows may be added later but should treated as being added earlier



回答3:

Or add another column whose sole purpose is to record the editing order. I suggest you do not use datetime for this.