one table or many for many different but interacti

2020-06-05 06:20发布

问题:

I'm creating an application that as its core function tracks various data over time (blood glucose levels, insulin dosages, food intake etc) and I'm trying to decide how best to organize this information in a database.

At its most basic everything within this particular umbrella is an event, so I thought of having a single Events table with fields for all the properties that might come up. This may be unwieldy, though, because the vast majority of the fields will end up being blank for many of the entires; but i'm not sure if that's actually a problem. The benefit of that way is that it will be easier to call and display all the events. But since many of the events will only have 'timestamp' in common, i question whether they belong on the same table.

I'm not sure that it makes sense to have a table for every kind of event, because taken separately most of the events have only one property other than timestamp, and they will often have to co-mingle. (many types of data will often but not always come in a group)

some types of events have durations. some are comparatively very rare. One class of events are normally a rate that stays the same unless the rate is altered for good or with a temporary override (those are the ones i'm most worried about). Some are simple binary tags (which i was planning on having a linking table for, but to make that easy I would need/prefer an overall event_id to link them with.

My inclination is that is best to have a few tables with closely related types of information rather than one table with everything and a whole lot of space.. but am not quite sure how to proceed.

I would love some advice on strategy for determining the best approach in a situation like this.

edit: Here is a rundown of the data types I'm dealing with in case it makes things more clear

events:
-blood glucose 
     timestamp
     value 
     (tagged w/: from pump, manually entered
     [pre-meal, post-meal (breakfast, lunch, dinner) before bed, fasting, hypo, high, hyper  - which will be either manually entered or inferred based on settings or other user entries], before/after exercise etc i imagine would be better off dynamically generated with queries as necessary. though could apply same paradigm to the meals?

-sensor glucose (must be separate bc it is not as reliable so will be different number from regular bg test, also unlikely to be used by majority of users.)
     timestamp
     amount

-bolus 
     (timestamp)
     bolus total
     food total
     correction total 
     active insulin**
     bolus type - normal[vast majority] square wave or dual wave

-food
     (timestamp)
     carb amount
     carb type (by weight or exchanges) <- this could probably be in user settings table
     food-description
     carb-estimated (binary) 
     meal? - or separate table.
     (accompanying bolus id? though that seems to finicky)

-meals
     timestamp
     mealname (breakfast, lunch, supper) (or mealnames table? seems excessive?)

-basal
     timestamp
     rate per hour
     rate changes throughout day on regular pattern, so either automatically fill in from 'last activated pattern' (in the form midnight: 0.7/hr, 7am: 0.9/hr, 12pm: 0.8/hr etc)
     create new pattern whenever one is used

-temp basal
     (regular basal pattern can be overridden with temporary basal)
     temp basal start
     ?temp basal end and/or temp basal duration
     temp basal amount
     temp basal type -> either in % or specific rate.

-exercise
     start-time
     end-time
     intensity
     ?description (unless 'notes' is universal for any event)

-pump rewind (every 3 days or so)
     -time

-pump prime
     -amount
     -type (fixed or manual)

-pump suspended
     start-time
     end-time

-keytones
     time
     result

-starred
     event

-flagged
     event

-notes
     timestamp
     (user can place a note with any event to provide details or comments, but might want a note where there is no data as well.)

(i want a way for users to flag specific events to indicate they are result of error or otherwise suspect, and to star events as noteworthy either to discuss with doctor or to look at later)

**only place I get active insulin from is when a bolus is entered, but it could be useful other times as a constantly tracked variable, which could be calculated by looking at boluses delivered up to X time ago where X is the Active Insulin Time.

other infrequent events (likely 2-10 per year):
-HbA1C 
     time
     value
-weight
     time
     value
     units
-cholesterol
     time
     value
-blood pressure
     time
     value

-pump settings (will need to track settings changes, but should be able to do that with queries)
     -timestamp
     -bg-target
     -active insulin time
     -carb ratios (changes throughout day like basal)
     -sensitivity
     -active insulin time

concerns. 1) overarching 'events' table with a type, to quickly bring back all events in period of time without having to query every single table? (disadvantage is how do I work with events with duration? have optional end-time on event table?)

2) this is a local database which will generally be one user, and there will never be a need to compare or interact any of the records of other users if it is synced online, so i was thinking of just keeping one version of the database per user, though maybe adding a 'user' id as it is uploaded.

3) many of the events will often go together for ease of interpretation and analysis (blood sugar, meal, food, bolus, notes for instance), i gather it's better to do that after the fact with queries rather than hardcoding anything to maintain integrity.

Some info on what the database will be used for: -A visual representation of all data types over the course of a day -Average all test results and percent of insulin which is used for food, correction, basal. -As well as specific advanced queries like: list up to 20 examples of the difference in glucose level between before bed glucose and morning glucose when no food eaten and no exercise w/in 2 hours of bed, since settings were last changed, etc. -program will automatically assign tags based on parameters. like if >20 carbs are eaten during assigned 'lunch' period, it will say that food is lunch. if there are two food intakes within 30 minutes (or 'meal length' preference) it will group them as one meal.. not totally sure how that will function right now.

回答1:

V1.0

Relational Databases, and SQL (which was designed for them) perform much better when the data is organised and Normalised. The one big table is un-normalised, and crippled, in terms of both performance and Relational power.

Your requirement calls for an ordinary Supertype-Subtype cluster of tables. Unfortunately ordinary Relational structures such as this are not "common".

  • The Standard Subtype symbol is the semicircle.

    • The cardinality of the Supertype::Subtype is always 1::0-to-1.

    • The Subtype Primary Key is the Supertype Primary Key. It is also the Foreign Key to the Supertype.

  • There are two types:

    • Exclusive, where there is only one Subtype for each Supertype row, denoted with an X through the semicircle.

    • Non-exclusive, where there is more than one Subtype per Supertype row

  • Yours is Exclusive. This type needs a Discriminator, to identify which Subtype is active for the Supertype row. Where the number of Subtypes is small, Indicators can be used; otherwise a classification table is required.

  • Note that all this, the structures, the rules, the constraints, that are required to support it, and to provide Data Integrity, is available in ordinary IEC/ISO/ANSI SQL. (The Non-SQLs do not comply with the SQL requirement).

Data

  1. Naming is very important. We are advised to name the table by the row, not the content or meaning or action. You speak of Events, but I can only see Readings.

  2. There must be a context for these Readings or Events. I do not see how an EventId hangs in the air. I have assumed that the Readings are about a particular Patient. Please advise, and I will change the model.

  3. Composite or Compound Keys are normal. SQL is quite capable (the Non-SQLs aren't). PatientId already exists as an FK in Reading, and it is used to form its PK. There is no need for an additional ReadingId column and the additional index, which would be 100% redundant.

  4. SQL is also quite capable of handling many tables (the database I am working on currently exceeds 500 tables), and large numbers of smaller tables are the nature of Relational Databases.

  5. This is pure Fifth Normal Form (no columns duplicated; no Update Anomalies).

    • This can be further Normalised to Sixth Normal Form, and thus further benefits can be gained; and the 6NF can be optimised, etc.; but all that is not required here.

    • Some tables happen to be in 6NF, but that is a consequence, not an intent, so it cannot be declared as such.
      .

  6. If you provide information about the limits and overrides that concern you, I can provide a model that resolves those issues.

  7. Since the data is modelled, it is already set up for very fast comparisons (generating alarms, etc.).

▶Reading Data Model◀

Readers who are unfamiliar with the Standard for Modelling Relational Databases may find ▶IDEF1X Notational◀ useful.

Feel free to ask clarifying questions, either as comments, or as Edits to your question.

Caveat

  1. The OO and ORM crowd (lead by Fowler and Ambler) are clueless about Relational technology and Databases. Designing Objects is quite different to modelling data. If you apply their Object design to databases, you will end up with monstrosities that need "re-factoring", and you will have to buy yet another "book" that shows you how to do that efficiently. In the meantime the "database" is crippled.

  2. Relational Databases that are modelled correctly (as data, not objects) never need "re-factoring". In highly Normalised Databases, you can add tables, columns and functions without having to change existing data or code.

  3. Even the concept of ORM is totally flawed. Data has more permanence than Objects. If you model the data first, then model your Objects for the data, it is very stable. But if you model your objects first (which is weird anyway, without an understanding of the data), then model the data after the Objects, you will be going back and forth, constantly correctly both.

  4. Relational Databases have had perfectly ordinary structures such as Supertype-Subtype for over 30 years, and they work well, if they are implemented as that. They are not "gen-spec" or "class-inheritance" or any such OO thing; and if those OO or ORM structures are implemented, without modelling the data correctly, the "database" will be crippled, and need "we-factoring".

    • Additionally, they do not implement the required Data Integrity constraints, so usually data quality is poor. We do not allow bad data to enter the Database; their "databases" are full of bad data, and they need another "book" on how to wash dirty data.
      .
  5. They have the sequence, and the hierarchy, mixed up. Done correctly, there is no "impedance mismatch", no pseudo-technical names to mask pure stupidity; to justify doing the same set of work over and over and over again.

So run like hell from anyone using OO or ORM terminology when dealing with Relational Databases.

V1.1

Your Edit provides far more detail, which of course is demanded, because the context, the whole, is necessary, if data is to be modelled correctly. This incorporates all that info. However, questions remain, and some back-and-forth will be required before it can be complete. Feel free to ask questions about anything that is not absolutely clear; I am not sure exactly what the gap is until I throw something up, and you speak to it.

▶Event Data Model V1.1◀

  1. All my models are pure Relational (retain full Relational power), IDEF1X compliant and Fifth Normal Form (no Update Anomalies). All Rules (business or data/referential Integrity) that are drawn in the model can be implemented as Declaration in ISO/IEC/ANSI SQL.

  2. Never hard-code anything. My models do not require that, and any code working with the database does not have to do that. All fixed text is Normalised into Reference or Look-up tables. (that bit is incomplete,; you need to fill in the gaps).

    • A short alphabetic code is far better than an Enumeration; once you get used to it, the values and meanings become immediately recognisable.

    • Because they are PKs, and therefore stable, you can safely code:

      ... WHERE EventTypeCode = "P"
      or
      ... WHERE EventTypeCode LIKE "T%"

  3. I believe the DataTypes are self-evident or can be worked out easily. If not, please ask.

  4. Everything that your note as "finicky" is perfectly valid. The issue is, since you have not had a database to engage with, you did not know what should be in the database vs what should be or can be SQL code. Therefore all the "finicky" items have been provided for (the database elements), you need to construct the code. Again, if there is a gap please ask.

    • What I am saying is, working in the traditional style of I am the Data Modeller, you are the Developer, you have to ensure every item from your perspective is delivered, rather than relying on me to interpret your notes. I will be delivering a database that supports all the requirements that I can glean from your notes.
      .
  5. One Patient per Database. Let's allow for the possibility that your system is successful, in the future, you will have one central workhorse database, rather than limiting it one database per patient, which would be a nightmare to administer. Let's say that you need to keep all your Patient details in one place, one version of the truth. That is what I have provided. This does not limit you in the short term, from implementing one Db per patient; there is no problem at all with only one row in the Patient table.

    • Alternately, I can strip PatientId out of al the tables, and when you grow into a central database configuration, you will require a major database upgrade.

    • Likewise, if you have Sensors or Pumps that you need to track, please identify their attributes. Any Sensor or Pump attributes would then be Normalised into those tables. If they are "one per patient" that's fine, there will be one row in those tables, unless you need to store the history of Sensors or Pumps.

  6. In V1.0 the Subtypes were Exclusive. Now they are Non-exclusive. This means we are tracking a chronology of Events, without duplication; and any single Event may consist of more than one Subtype. Eg. Notes can be inserted for any Event.

    • Before completion, the EventType list provided needs to be filed out in the form of a grid, showing (a) permitted (b) mandatory Subtypes per EventType. Thate will be implemented as CHECK Constraints in Event.
      .
  7. Naming is very important. I am using ISO standard 11179 (guidelines and principles) plus my own conventions. Reading type Events are prefixed as such. Feel free to suggest changes.

  8. Units. Traditionally, we use either Metric xor US Imperial across the database, allow entry in whatever the user likes, and convert before storage. If you need a mixture, then at least we should have the UnitType specified at the Patient or Pump level, rather than allowing storage of either UnitType. If you really need either UnitType stored, changing back and forth, then yes, we need to store UnitType with each such Value.

  9. Temporal Database. You have Times Series being recorded, and well as interpreted via SQL. Big subject, so read up on it. The minimum I would ask you to read and understand is:

    ▶Temporal Database Performance (0NF vs 5NF)◀

    ▶Classic 5NF Temporal Database◀ (Inspect the Data Model carefully)

  10. Basically the issue boils down to this:

    • Either you have a true 5NF database, no data duplication, no Update Anomalies.

      • That means, for continuous time series, only the StartDateTime is recorded. The EndDtateTime is easily derived from the StartDateTime of the next row, it is not stored. Eg. Event is a continuos chronology; the EventType identifies whether the Event is a specific DateTime or a Period/Duration.

      • EndDateTime is stored only for disjoint Periods, where there are legitimate gaps between Periods; in any case it is clearly identified via the EventType. Eg. Exercise, PumpSuspended. (Incidentally, I am suggesting the patient only knows the actual, as opposed to planned, attributes, at the end of the Exercise period.)

      • Since generally there is no EndDateTime, StartDateTime is simply DateTime. Eg. EventDtm

      • This requires the use of ordinary SQL Subqueries. This is actually quite simple once the coder has a grasp on the subject. For those who don't, I have supplied a full tutorial on Subqueries in general, and using them in a Temporal context in particular, in:

      ▶It Is Easy When You Know How◀. Not coincidentally, re the very same Classic 5NF Temporal Database above.

    • XOR you have a database with EndDateTime stored (100% duplication) with every StartDateTime column, and you can use flat, slow queries. Lots of manipulating large result sets with GROUP BYs, instead of small result sets. Massive data duplication and Update Anomalies have been introduced, reducing the database to a flat file, to supply the needs of coders with limited ability (certainly not "ease of coding").

    • Therefore, consider carefully and choose, for the long term only, because this affects every code segment accessing temporal data. You do not want a re-write halfway down the track when you realise that maintaining Update Anomalies is worse than writing Subqueries.

      • Of course, I will provide the explicit requirements to support a 5NF Temporal Database, correct DataTypes, etc., to support all your identified requirements.

      • Further, if you choose 0NF, I will provide those fields, so that the Data Model is complete for your purpose.

      • In either case, you need to work out exactly the SQL code required for any given query.

  11. DataType handling is important. Do not store Time (hours, etc) as Integer or an offset. Store it only as TIME or DATETIME Datatype. If an offset, store it as Time since midnight. That will allow unrestricted SQL, and Date Arithmetic functions.

  12. Task for you. Go through the model carefully, and ensure that:

    • every non-key Attribute has a 1::1 relationship with its Primary Key

    • and that it does not have a relationship to any other PK (in some other table)

    And of course, check the Model and provide feedback.

Question

Given the above explanations and guidance.

  • What is ReadingBasalTemperature.Type, list values please ?

  • What is HbA1C ?

  • What is KeyTone ?

  • Do we need (ie. Duration/Period EndDateTime`):

    • ReadingBasalTemperatureEnd
    • ReadingBolusEnd
    • Basal Pattern
    • BasalTemp Pattern
    • Actually, what is a pattern, and how is it derived/compared ?
  • How is BasalTemperatureEnd (Or Duration) determined

  • Starting position is, there is no need to store Active Insulin Duration. But you need to define how the EndDateTime is determined. Based on that, if it cannot be easily derived, and or it based on too many factors or changes all the time, storing an EndDateTime might be good.

  • The Pump Settings need clarification.

V1.2

Ok, I have incorporated all information you have proved in the question and the comments. Here is a progressed Data Model.

▶Event Data Model V1.2◀

There are still some issues to be resolved.

  • Use a Percentage or a Rate only, not both with an additional indicator. One can be derived from the other. I am using Rate consistently.

  • ... the only worry about the approach is that for many days the basal rate will be identical.. hence redundancy

    • That is not "redundancy". That is storage of a time series of facts, which happen to be unchanging. The queries required are straight-forward.

    • However, in advanced use, yes, you can avoid storing an unchanged fact, and instead extend the duration to include the new time interval.

  • I am still not clear re your explanation of Basal Temp. Please study the new Model. First, the patterns are now stored separately. Second, we are recording a Basal Temp Start with a Rate. Do we need a Basal Temp End (with a Rate) ?

  • "GlucoseEventType would be able to have more than one value per Glucose Result" needs more definition. Don't worry about ID keys. Just tell me about the data. For each ReadingGlucoseBlood, name the result values, and which GlucoseEventType they apply to; which are mandatory and which are optional.

  • PumpHistory.InsulinEndDateTime is the ending Instant for the Duration. Of course that is generic, the starting Instant is whatever row you compare it to. Thus it should be seconds or minutes since midnight 01 Jan 1900.

  • Check the new Event PK. Where the incoming record identifies several Events, you need to parse that, and INSERT each Event-EventSubtype row, using the same DateTime.

  • Except for Patient, there are no ID keys in this database, none are required thus far. Refer to the parent by full PK.

05 Feb 11

No feedback received re V1.2.

a lot of the data i'm getting is being pulled from an external (and somewhat disorganized) csv which groups certain event types under one row and often has events on the same second, which is as granular as it gets

That is easy to overcome. However, that means that an Instant is not an Instant. Now, I could walk you through the whole exercise, but the bottom line is simple.

  • If you really need it, we could add a SequenceNo to the PK, to make it unique. But I suspect the EventTypeCode is enough (there will not be more than one EventType per second). If not, let me know, and I will change the moel.

  • Retain the meaning of an Instant as an Instant, and thus avoid departing from the architectural requirements of Temporal Databases.

  • Use EventType to afford uniqueness to the DateTime Pk.

    • Keep in mind that the EventTypeCode is deployed in the Event PK, not as a Discriminator requirement, but to afford uniqueness. Thus its presence in the PK of the Subtypes is an artefact, not that of a Discriminator (which is already known by virtue of the Subtype).
  • However there is unnecessary complexity due to the Non-exclusive Subtype (there can be more than one Subtype per Supertype row).

  • Therefore I have changed it back to an Exclusive Subtype, deterministic. One EventType per Supertype row; max one Subtype.

Refer to Implementing Referential Integrity for Subtypes for specific information re Constraints, etc.

The change to the Data Model is too small to warrant another release. I have updated the V1.2 Data Model.

06 Mar 11

Due upholding the "above all, be technically honest" guideline in the FAQ, and confronting misinformation as requested therein, I was suspended for my efforts (which means I will no longer correct misinformation on SO, and such posters have protected reign). Interaction with the seeker was carried on, to completion, and the Final Data Model was completed, away from SO. The progression is therefore lost to SO readers. However, it may be of value to post the result, the ▶Final Data Model V1.16◀.

  • Events always have a starting Instant (Event.DateTime).
  • Events may be Durations, in which case an ending Instant (Event) is required.
  • Some Events consist of only the Supertype; others require a Subtype. This is identified in third column of the EventType exposition.
  • The fourth column identifies the type of Event:
    • Instant or Duration
    • Duration: Conjunct or Disjunct
  • Note that the resolution of DateTime on the seeker's platform is one second, and many Events may occur in one second, but not more than one of the same EventType. EventTypeCode has therefore been included in the Event Primary Key to implement that rule. Thus it is an artefact, it is not a generic requirement for a supertype-subtype structure or for Exclusive/Non-exclusive subtypes.
  • Intended for printing on two facing US Letter pages, enlarged or not, with a gusset.


回答2:

No, the second you say something like "the vast majority of the fields will end up being blank for many of the entries", your schema is almost certainly broken.

Blood glucose should be in its own table with nothing more than a date/time and level. Food intake similarly, if you're just storing the carb content. If you're storing the individual components of the meal, you should have a one to many relationship between the food-intake table and the food-components table.

Same as the insulin, with a timestamp and quantity.

The right way to represent the relationships between disparate objects is with, well, relationships: foreign keys and such.


And, as a "nothing-to-do-with-the-programming-side-of-it" aside (on the co-mingling of events), if you're taking your insulin and having your meal at the same time, you're probably doing it wrong. Most fast-acting insulin takes half an hour to start doing a proper job and the food entering your gut will start the production of sugars in about five to ten minutes. So there should be about a half-hour delay between shooting up and eating.

The only usual "same-timestamp-or-thereabouts" events that I'm aware of would be the blood sugar level immediately before the meal to ensure the insulin you injected half an hour ago has started working.

But I'm not your endocrinologist and I don't even play one on TV :-) So check with your medical bods first.




回答3:

"My inclination is that is best to have a few tables with closely related types of information rather than one table with everything and a whole lot of space.. but am not quite sure how to proceed."

Simple and best. To see why, we can examine the alternatives.

One table per metric. Fully normalized but incredibly fatiguing. I'm normally a normalization hawk, and also I normally say 'tables are free', this is not quite true when they start to number in the dozens or hundreds for like things. So we'd prefer something simpler.

At the other end we have E-A-V, one table, with values like test_id, metric_id, value. This is well known to be nigh-on impossible to query and work with. Like the venus flytrap, draws you in with sweet smelling nectar, then closes in on you and eats you.

On the gripping hand, there is the one large table with all possible columns. This is called the 'sparse' solution for obvious reasons. I did this in Direct Marketing and it works well, but that was a highly specialized situation, this approach is not generally recommended.

So somewhere in the middle is a handful of tables, one table for each set of tests with similar values to store.

Hmmmm, which is just what you suggested. Sounds good!



回答4:

Take a look at these SO examples: one, two, three, four.