I am using and working on software which uses MySQL as a backend engine (it can use others such as PostgreSQL or Oracle or SQLite, but this is the main application we are using). The software was design in such way that the binary data we want to access is kept as BLOBs in individual columns (each table has one BLOB column, other columns have integers/floats to characterize the BLOB, and one string column with the BLOB's MD5 hash). The tables have typically 2, 3 or 4 indexes, one of which is always the MD5 column, which is made UNIQUE
. Some tables already have millions of entries, and they have entered the multi-gigabyte in size. We keep separate per-year MySQL databases in the same server (so far). The hardware is quite reasonable (I think) for general applications (a Dell PowerEdge 2U-form server).
MySQL SELECT
queries are relatively fast. There's little complaint there, since these are (most of the time) in batch mode. However, INSERT
queries take a long time, which increases with table size (number of rows). Admittedly, this is because the MD5 column is of type UNIQUE
and so each INSERT
has to figure out whether each new row has a corresponding, already-inserted, MD5 string. And it's not too strange (I think) if the performance gets worse if there are other indexes (not unique). But I still can't put my mind to rest that this software architecture choice (I suspect keeping BLOBs in the table row instead of disk has a significant, negative impact) is not the best choice. Insertions are not critical, but it is an annoying feeling to have.
Does anyone have experience in similar situations? With MySQL, or even other (preferably Linux-based) RDBMes? Any insights you would care to provide, maybe some performance figures?
BTW, the working language is C++ (which wraps C calls to MySQL's API).
It could be a time for horizontal partitioning and moving blob field into a separate table. In this article in 'A Quick Side Note on Vertical Partitioning' author removes a larger varchar field from a table and it increases speed of a query about order of magnitude.
The reason is physical traversal of the data on a disk becomes significantly faster if there is less space to cover, so moving bigger fields elsewhere increases performance.
Also (and you probably do it already) it is beneficial to decrease the size of your index column to its absolute minumum (char(32) in ascii encoding for md5), because size of the key is directly proportional to the speed of its use.
If you do multiple inserts at a time with InnoDB tables you can significantly increase speed of inserts by wrapping them into transaction and doing mupliple inserts in one query:
START TRANSACTION
INSERT INTO x (id, md5, field1, field2) values (1, '123dab...', 'data1','data2'),(2,'ab2...','data3','data4'),.....;
COMMIT
See Speed of INSERT Statements. Do you have frequent MD5 collisions? I believe these should not happen too many times, so maybe you can use something like INSERT ... ON DUPLICATE to handle the collisions. If you have specific insert periods, you can disable keys for the time of the insert and restore them later. Another option is to use replication, using a master machine for the inserts and a slave for the selects.
Are you using MyISAM?
AFAIK MyISAM has a very good read-performance, but bad write performance.
InnoDB should be balanced in speed.
Does your data fit in RAM? If not, get more RAM until that becomes uneconomic (16G is usually about the point for most people).
Then, do your indexes fit in the MyISAM key buffer?
If you're running a 32-bit OS, don't. Once you're on a 64-bit OS, set the key buffer to be approx 1/3 of the ram. RAM is used by the OS's cache to cache data files (which does little for inserts but is beneficial for selects).
Having multi-gigabyte tables in MyISAM can be a pain because in the event of an unclean shutdown, very lengthy repair operation(s) are required, but
Don't switch MySQL engines without significant validation of your application, it will change the behaviour in many ways (not just performance). It will affect disc space usage.
I asked a somewhat-related question today as well.
One of the answers provided is to consider the INSERT DELAYED
so that it goes into the insert queue, and is handled when the db is not as busy.