How to design a database for User Defined Fields?

2019-01-05 06:43发布

My requirements are:

  • Need to be able to dynamically add User-Defined fields of any data type
  • Need to be able to query UDFs quickly
  • Need to be able to do calculations on UDFs based on datatype
  • Need to be able to sort UDFs based on datatype

Other Information:

  • I'm looking for performance primarily
  • There are a few million Master records which can have UDF data attached
  • When I last checked, there were over 50mil UDF records in our current database
  • Most of the time, a UDF is only attached to a few thousand of the Master records, not all of them
  • UDFs are not joined or used as keys. They're just data used for queries or reports

Options:

  1. Create a big table with StringValue1, StringValue2... IntValue1, IntValue2,... etc. I hate this idea, but will consider it if someone can tell me it is better than other ideas and why.

  2. Create a dynamic table which adds a new column on demand as needed. I also don't like this idea since I feel performance would be slow unless you indexed every column.

  3. Create a single table containing UDFName, UDFDataType, and Value. When a new UDF gets added, generate a View which pulls just that data and parses it into whatever type is specified. Items which don't meet the parsing criteria return NULL.

  4. Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added

  5. XML DataTypes? I haven't worked with these before but have seen them mentioned. Not sure if they'd give me the results I want, especially with performance.

  6. Something else?

13条回答
太酷不给撩
2楼-- · 2019-01-05 07:05

This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.

This is also the solution used in some commercial enterprise applications.

EDIT

another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.

many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them

postgress

mysql

查看更多
看我几分像从前
3楼-- · 2019-01-05 07:10

If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.

Pros:

  1. Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.

  2. You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.

  3. You can use table/column names that reflect what the data actually is.

  4. You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.

  5. Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.

Cons:

  1. This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.

  2. There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.

Other Considerations:

  1. If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like

     'red', 'large', 45.03 
    

    rather than

     NULL, 'medium', NULL
    

    In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.

  2. If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.

  3. Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.

查看更多
老娘就宠你
4楼-- · 2019-01-05 07:13

SharePoint uses option 1 and has reasonable performance.

查看更多
Root(大扎)
5楼-- · 2019-01-05 07:15

This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.

They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.

I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.

You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.

You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.

With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.

If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.

查看更多
爷、活的狠高调
6楼-- · 2019-01-05 07:17

Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:

  1. (EntityID, FieldID, Value) table for searching the data
  2. a JSON field in the entities table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values).

You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.

P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can through a couple of bucks on a cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine (really an "r3" vm on AWS).

查看更多
贪生不怕死
7楼-- · 2019-01-05 07:18

If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.

XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.

Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!

查看更多
登录 后发表回答