Storing changes on entities: Is MySQL the proper s

2019-06-15 14:42发布

站内文章 / MySQL

74 0

混吃等死

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

i want to store changes that i do on my "entity" table. This should be like a log. Currently it is implemented with this table in MySQL:

CREATE TABLE `entitychange` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `entity_id` int(10) unsigned NOT NULL,
  `entitytype` enum('STRING_1','STRING_2','SOMEBOOL','SOMEDOUBLE','SOMETIMESTAMP') NOT NULL DEFAULT 'STRING_1',
  `when` TIMESTAMP NOT NULL,
  `value` TEXT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

entity_id = the primary key of my entity table.
entitytype = the field that was changed in the entity table. sometimes only one field is changed, sometimes multiple. one change = one row.
value = the string representation of the "new value" of the field.

Example when changing Field entity.somedouble from 3 to 2, i run those queries:

UPDATE entity SET somedouble = 2 WHERE entity_id = 123;
INSERT INTO entitychange (entity_id,entitytype,value) VALUES (123,'SOMEDOUBLE',2);

I need to select the changes of a specific entity and entitytype of the last 15 days. For example: The last changes with SOMEDOUBLE for entity_id 123 within the last 15 days.

Now, there are two things that i dislike:

All Data is stored as TEXT - although most (less than 1%) isn't really text, in my case, most values are DOUBLE. Is this a big problem?
The Table is getting really, really slow when inserting, since the table already has 200 million rows. Currently my Server load is up to 10-15 because of this.

My Question: How do i address those two "bottlenecks"? I need to scale.

My approaches would be:

Store it like this: http://sqlfiddle.com/#!2/df9d0 (click on browse) - Store the changes in the entitychange table and then store the value according to its datatype in entitychange_[bool|timestamp|double|string]
Use partitioning by HASH(entity_id) - i thought of ~50 partitions.
Should I use another database system, maybe MongoDB?

回答1:

If I were facing the problem you mentioned, I would design LOG table like bellow:

EntityName: (String) Entity that is being manipulated.(mandatory)
ObjectId: Entity that is being manipulated, primary key.
FieldName: (String) Entity field name.
OldValue: (String) Entity field old value.
NewValue: (String) Entity field new value.
UserCode: Application user unique identifier. (mandatory)
TransactionCode: Any operation changing the entities will need to have a unique transaction code (like GUID) (mandatory),
In case of an update on an entity changing multiple fields,these column will be the key point to trace all changes in the update(transcation)
ChangeDate: Transaction date. (mandatory)
FieldType: enumeration or text showing the field type like TEXT or Double. (mandatory)

Having this approach
Any entity (table) could be traced
Reports will be readable
Only changes will be logged.
Transaction code will be the key point to detect changes by a single action.

BTW

Store the changes in the entitychange table and then store the value 
according to its datatype in entitychange_[bool|timestamp|double|string]

Won't be needed, in the single table you will have changes and data types

Use partitioning by HASH(entity_id)

I will prefer partitioning by ChangeDate or creating backup tables for changeDate that are old enough to be backed up and remover from the main LOG table

Should I use another database system, maybe MongoDB?

Any data base comes with its own prob and cons , you can use the design on any RDBMS. A useful comparison of documant based data bases like MongoDB could be found here

hope be helpful.

回答2:

Now I think I understand what you need, a versionable table with history of the records changed. This could be another way of achieving the same and you could easily make some quick tests in order to see if it gives you better performance than your current solution. Its the way Symfony PHP Framework does it in Doctrine with the Versionable plugin.
Have in mind that there is a primary key unique index of two keys, version and fk_entity.
Also take a look at the values saved. You will save a 0 value in the fields which didnt change and the changed value in those who changed.

CREATE TABLE `entity_versionable` (
  `version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `fk_entity` INT(10) UNSIGNED NOT NULL,
  `str1` VARCHAR(255),
  `str2` VARCHAR(255),
  `bool1` BOOLEAN,
  `double1` DOUBLE,
  `date` TIMESTAMP NOT NULL,
  PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB DEFAULT CHARSET=latin1;


INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "0", "0", "0", "2013-06-02 17:13:16");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a2", "0", "0", "0", "2013-06-11 17:13:12");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b1", "0", "0", "2013-06-11 17:13:21");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "b2", "0", "0", "2013-06-11 17:13:42");
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "0", "0", "1", "0", "2013-06-16 17:19:31");

/*Another example*/
INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
VALUES ("1", "a1", "b1", "0", "0", CURRENT_TIMESTAMP);


SELECT * FROM `entity_versionable` t WHERE 
(
    (t.`fk_entity`="1") AND 
    (t.`date` >= (CURDATE() - INTERVAL 15 DAY))
);

And probably another step to improve performance, it could be to save all history log records in separate tables, once per month or so. That way you wont have many records in each table, and searching by date will be really fast.

回答3:

There two main challenges here:

How to store data efficiently, i.e. taking less space and being in an easy to use format

2-3. Managing a big table: archiving, ease for backup and restore

2-3. Performance optimisation: faster inserts and selects

Storing data efficiently

value filed. I would suggest to make it VARCHAR (N). Reasons:
- Using N<255 will save 1 byte per row just because of the data type.
- Using other data types for this filed: fixed types use space whatever the value is, and normally it will be 8 bytes per row (datetime, long integer, char (8)) and other variable datatypes are too big for this field.
- Also TEXT data type results in performance penalties: (from manaul on BLOB and Text data types)

Instances of TEXT columns in the result of a query that is processed using a temporary table causes the server to use a table on disk rather than in memory because the MEMORY storage engine does not support those data types. Use of disk incurs a performance penalty, so include BLOB or TEXT columns in the query result only if they are really needed. For example, avoid using SELECT *, which selects all columns.

Each BLOB or TEXT value is represented internally by a separately allocated object. This is in contrast to all other data types, for which storage is allocated once per column when the table is opened.

Basically TEXT is designed to store big strings and pieced of text, whereas VARCHAR() is designed relatively short strings.

id field. (updated, thanks to @steve) I agree that this field does not carry any useful information. Use 3 columns for your primary key: entity_id and entitype and when . TIMESTAMP will guarantee you pretty well that there will be no duplicates. Also same columns will be used for partitioning/sub-partitioning.

Table manageability There are two main options: MERGE tables and Partitioning. MERGE storage engine is based on My_ISAM, which is being gradually phased out as far as I understand. Here is some reading on [MERGE Storage Engine].2

Main tool is Partitioning and it provides two main benefits: 1. Partition switching (which is often an instant operation on large chunk of data) and rolling window scenario: insert new data in one table and then instantly switch all of it into archive table. 2. Storing data in sorted order, that enables partition pruning - querying only those partitions, that contain needed data. MySQL allows sub-partitioning to group data further.

Partitioning by entity_id makes sense. If you need to query data for extended periods of time or you have other pattern in querying your table - use that column for sub-partitioing. There is no need for sub- partitioning on all columns of primary key, unless partitions will be switched at that level.

Number of partitions depends on how big you want db file for that partition to be. Number of sub-partitions depends on number of cores, so each core can search its own partition, N-1 sub-partitions should be ok, so 1 core can do overall coordination work.

Optimisation

Inserts:

Inserts are faster on table without indexes, so insert big chunk of data (do your updates), then create indexes (if possible).
Change Text for Varchar - it take some strain off db engine
Minimal logging and table locks may help, but not often possible to use

Selects:

Text to Varchar should definitely improve things.
Have a current table with recent data - last 15 days, then move to archive via partition switching. Here you have an option to partition table different to archive table (eg. by date first, then entity_id), and change partitioning manner by moving small (1 day) of data to temp table anв changing partitioning of it.

Also you can consider partitioning by date, you have many queries on date ranges. Put usage of your data and its parts first and then decide which schema will support it best.

And as for your 3rd question, I do not see how use of MongoDB will specifically benefit this situation.

回答4:

This is called a temporal database, and researchers have been struggling with the best way to store and query temporal data for over 20 years.

Trying to store the EAV data as you are doing is inefficient, in that storing numeric data in a TEXT column uses a lot of space, and your table is getting longer and longer, as you have discovered.

Another option which is sometimes called Sixth Normal Form (although there are multiple unrelated definitions for 6NF), is to store an extra table to store revisions for each column you want to be tracked temporally. This is similar to the solution posed by @xtrm's answer, but it doesn't need to store redundant copies of columns that haven't changed. But it does lead to an explosion in the number of tables.

I've started to read about Anchor Modeling, which promises to handle temporal changes of both structure and content. But I don't understand it well enough to explain it yet. I'll just link to it and maybe it'll make sense to you.

Here are a couple of books that contain discussions of temporal databases:

Joe Celko's SQL for Smarties, 4th ed.
Temporal Data & the Relational Model, C.J. Date, Hugh Darwen, Nikos Lorentzos

回答5:

Storing an integer in a TEXT column is a no-go! TEXT is the most expensive type.

I would go as far as creating one log table per field you want to monitor:

CREATE TABLE entitychange_somestring (
    entity_id INT NOT NULL PRIMARY KEY,
    ts TIMESTAMP NOT NULL,
    newvalue VARCHAR(50) NOT NULL, -- same type as entity.somestring
    KEY(entity_id, ts)
) ENGINE=MyISAM;

Partition them, indeed.

Notice I recommend using the MyISAM engine. You do not need transactions for this (these) unconstrained, insert-only table(s).

回答6:

Why is INSERTing so slow, and what can you do to make it faster.

These are the things I would look at (and roughly in the order I would work through them):

Creating a new AUTO_INCREMENT-id and inserting it into the primary key requires a lock (there is a special AUTO-INC lock in InnoDB, which is held until the statement finishes, effectively acting as a table lock in your scenario). This is not usually a problem as this is a relatively fast operation, but on the other hand, with a (Unix) load value of 10 to 15, you are likely to have processes waiting for that lock to be freed. From the information you supply, I don't see any use in your surrogate key 'id'. See if dropping that column changes performance significantly. (BTW, there is no rule that a table needs a primary key. If you don't have one, that's fine)
InnoDB can be relatively expensive for INSERTs. This is a trade off made to allow additional functionality such as transactions and may or may not be affecting you. Since all your actions are atomic, I see no need for transactions. That said, give MyISAM a try. Note: MyISAM is usually a bad choice for huge tables because it only supports table locking and not record level locking, but it does support concurrent inserts, so it might be a choice here (especially if you do drop the primary key, see above)
You could play with database storage engine parameters. Both InnoDB and MyISAM have options you could change. Some of them have an impact on how TEXT data is actually stored, others have a broader function. One you should specifically look at is innodb_flush_log_at_trx_commit.
TEXT columns are relatively expensive if (and only if) they have non-NULL values. You are currently storing all values in that TEXT column. It is worth giving the following a try: add extra fields value_int and value_double to your table and store those values in the corresponding column. Yes, that will waste some extra space, but might be faster - but this will largely be dependant on the database storage engine and its settings. Please note that a lot of what people think about TEXT column performance is not true. (See my answer to a related question on VARCHAR vs TEXT)
You suggested spreading the information over more than one table. This is only a good idea if your tables are fully independant of one another. Otherwise you'll end up with more than one INSERT operation for any change, and you're more than likely to make things a lot worse. While normalizing data is usually good(tm), it is likely to hurt performance here.

What can you do to make SELECTs run fast

Proper keys. And proper keys. And just in case I forgot to mention: proper keys. You don't specify in detail what your selects look like, but I assume them to be similar to "SELECT * FROM entitychange WHERE entity_id=123 AND ts>...". A single compound index on entity_id and ts should be enough to make this operation fast. Since the index has to be updated with every INSERT, it may be worth trying the performance of both entity_id, ts and ts, entity_id: It might make a difference.
Partitioning. I wouldn't even bring this subject up, if you hadn't asked in your question. You don't say why you'd like to partition the table. Performance-wise it usually makes no difference, provided that you have proper keys. There are some specific setups that can boost performance, but you'll need the proper hardware setup to go along with this. If you do decide to partition your table, consider doing that by either the entity_id or the TIMESTAMP column. Using the timestamp, you could end up with archiving system with older data being put on an archive drive. Such a partitioning system would however require some maintenance (adding partitions over time).

It seems to me that you're not as concerned about query performance as about the raw insert speed, so I won't go into more detail on SELECT performance. If this does interest you, please provide more detail.

回答7:

I would advise you to make a lot of in deep testing, but from my tests I am achiving very good results with both INSERT and SELECT with the table definition I posted before. I will detail my tests in depth so anyone could easily repeat and check if it gets better results. Backup your data before any test.
I must say that these are only tests, and may not reflect or improve your real case, but its a good way of learning and probably a way of finding usefull information and results.

The advises that we have seen here are really nice, and you will surely notice a great speed improvement by using a predefined type VARCHAR with size instead of TEXT. However you could gain speed, I would advise not to use MyISAM for data integrity reasons, stay with InnoDB.

TESTING:

1. Setup Table and INSERT 200 million of data:

CREATE TABLE `entity_versionable` (
  `version` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
  `fk_entity` INT(10) UNSIGNED NOT NULL,
  `str1` VARCHAR(255) DEFAULT NULL,
  `str2` VARCHAR(255) DEFAULT NULL,
  `bool1` TINYINT(1) DEFAULT NULL,
  `double1` DOUBLE DEFAULT NULL,
  `date` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`version`,`fk_entity`)
) ENGINE=INNODB AUTO_INCREMENT=230297534 DEFAULT CHARSET=latin1

In order to insert +200 million rows in about 35 mins in a table, please check my other question where peterm has answered one of the best ways to fill a table. It works perfectly.

Execute the following query 2 times in order to insert 200 million rows of no random data (change data each time to insert random data):

INSERT INTO `entity_versionable` (fk_entity, str1, str2, bool1, double1, DATE)
SELECT 1, 'a1', 238, 2, 524627, '2013-06-16 14:42:25'
FROM
(
    SELECT a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 + g.N * 1000000 + h.N * 10000000 + 1 N FROM 
     (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) c
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) d
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) e
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) f
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) g
    ,(SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) h
) t;

*Since you already have the original table with 200 million rows of real random data, you wont probably need to fill it, just export your table data and schema and import it into a new Testing table with the same schema. That way you will make tests in a new table with your real data, and the improvements you get will also work for the original one.

2. ALTER the new Test table for performance (or use my example above in step 1 to get better results). Once that we have our new Test table setup and filled with random data, we should check the above advises, and ALTER the table to speed it up:

Change TEXT to VARCHAR(255).
Select and make a good primary key unique index with two or three columns. Test with version autoincrement and fk_entity in your first test.
Partition your table if necessary, and check if it improves speed. I would advise not to partition it in your first tests, in order to check for real performance gain by changing data types and mysql configuration. Check the following link for some partition and improvement tips.
Optimize and repair your table. Index will be made again and will speed searchs a lot:

OPTIMIZE TABLE test.entity_versionable;
REPAIR TABLE test.entity_versionable;
*Make a script to execute optimize and maintain your index up to date, launching it every night.

3. Improve your MySQL and hardware configuration by carefully reading the following threads. They are worth reading and Im sure you will get better results.

Easily improve your Database hard disk configuration spending a bit
of money: If possible use a SSD for your main MySQL database, and a
stand alone mechanical hard disk for backup purposes. Set MySQL logs to be saved on another third hard disk to improve speed in your
INSERTs. (Remember to defragment mechanical hard disks after some weeks).
Performance links: general&multiple-cores, configuration, optimizing IO, Debiancores, best configuration, config 48gb ram..
Profiling a SQL query: How to profile a query, Check for possible bottleneck in a query
MySQL is very memory intensive, use low latency CL7 DDR3 memory if possible. A bit off topic, but if your system data is critical, you may look for ECC memory, however its expensive.

4. Finally, tests your INSERTs and SEARCHs in the test table. Im my tests with +200 million of random data with the above table schema, it spends 0,001seconds to INSERT a new row and about 2 minutes to search and SELECT 100 million rows. And however its only a test and seems to be good results :)

5. My System Configuration:

Database: MySQL 5.6.10 InnoDB database (test).
Processor: AMD Phenom II 1090T X6 core, 3910Mhz each core.
RAM: 16GB DDR3 1600Mhz CL8.
HD: Windows 7 64bits SP1 in SSD, mySQL installed in SSD, logs written in mechanical hard disk.
Probably we should get better results with one of the lastest Intel i5 or i7 easily overclocked to 4500Mhz+, since MySQL only uses one core for one SQL. The higher the core speed, the faster it will be executed.

6. Read more about MySQL:
O'Reilly High Performance MySQL
MySQL Optimizing SQL Statements

7. Using another database: MongoDB or Redis will be perfect for this case and probably a lot faster than MySQL. Both are very easy to learn, and both has their advantages:
- MongoDB: MongoDB log file growth

Redis

I would definitively go for Redis. If you learn how to save the log in Redis, it will be the best way to manage the log with insanely high speed: redis for logging
Have in mind the following advices if you use Redis:

Redis is compiled in C and its stored in memory, has some different methods to automatically save the information into disk (persistence), you wont probably have to worry about it. (in case of disaster scenario you will end loosing about 1 second of logging).
Redis is used in a lot of sites which manages terabytes of data, there are a lot of ways to handle that insane amount of information and it means that its secure (used here in stackoverflow, blizzard, twitter, youporn..)
Since your log will be very big, it will need to fit in memory in order to get speed without having to access the hard disk. You may save different logs for different dates and set only some of them in memory. In case of reaching memory limit, you wont have any errors and everything will still work perfectly, but check the Redis Faqs for more information.
Im totally sure that Redis will be a lot faster for this purpose than MySQL. You will need to learn about how to play with lists and sets to update data and query/search for data. If you may need really advanced query searches, you should go with MongoDB, but in this case of simple date searchs will be perfect for Redis.

Nice Redis article in Instagram Blog.

回答8:

At work we have logtables on almost every table due to customer conditions (financial sector).

We have done it this way: Two tables ("normal" table, and log table) and then triggers on insert/update/delete of the normal table whichs stores a keyword (I,U,D) and the old record (on update, delete) or the new one (on insert) inside the logtable

We have both tables in the same database-schema