BLOB vs. VARCHAR for storing arrays in a MySQL tab

2019-01-17 21:33发布

问题:

I've got a design decision to make and am looking for some best practice advice. I have a java program which needs to store a large number (few hundred a day) of floating point arrays in a MySQL database. The data is a fixed length Double array of length 300. I can see three reasonable options:

  1. Store the data as a BLOB.
  2. Serialize the data and store it as a VARCHAR.
  3. Write the data to disk as a binary file and store a reference to it instead.

I should also mention that this data will be read from and updated frequently.

I want to use a BLOB since that is what I have done in the past and it seems like the most efficient method (e.g., maintains fixed width & no need to convert to a comma separated string). However my coworker is insisting that we should serialize and use varchar for reasons which seem mostly dogmatic.

If one of these methods is better than the other, are the reasons Java or MySQL specific?

回答1:

Store as a BLOB like so (see code example below). I think this is probably better than using java serialization since java's builtin serialization will need 2427 bytes, and non-java applications will have a harder time dealing with the data. That is, should there ever be any non-java applications querying the database in the future.... if not then the builtin serialization is a few less lines.

public static void storeInDB() throws IOException, SQLException {

    double[] dubs = new double[300];

    ByteArrayOutputStream bout = new ByteArrayOutputStream();
    DataOutputStream dout = new DataOutputStream(bout);
    for (double d : dubs) {
        dout.writeDouble(d);
    }
    dout.close();
    byte[] asBytes = bout.toByteArray();

    PreparedStatement stmt = null;  // however we normally get this...
    stmt.setBytes(1, asBytes);

}

public static double[] readFromDB() throws IOException, SQLException {

    ResultSet rs = null;  // however we normally get this...
    while (rs.next()) {
        double[] dubs = new double[300];
        byte[] asBytes = rs.getBytes("myDoubles");
        ByteArrayInputStream bin = new ByteArrayInputStream(asBytes);
        DataInputStream din = new DataInputStream(bin);
        for (int i = 0; i < dubs.length; i++) {
            dubs[i] = din.readDouble();
        }
        return dubs;
    }

}

Edit: I'd hoped to use BINARY(2400), but MySQL says:

mysql> create table t (a binary(2400)) ;
ERROR 1074 (42000): Column length too big for column 'a' (max = 255);
use BLOB or TEXT instead


回答2:

Is there a reason you don't create a child table so you can store one floating point value per row, instead of an array?

Say you store a thousand arrays of 300 elements each per day. That's 300,000 rows per day, or 109.5 million per year. Nothing to sneeze at, but within the capabilities of MySQL or any other RDBMS.


Re your comments:

Sure, if the order is significant you add another column for the order. Here's how I'd design the table:

CREATE TABLE VectorData (
  trial_id INT NOT NULL,
  vector_no SMALLINT UNSIGNED NOT NULL,
  order_no SMALLINT UNSIGNED NOT NULL,
  element FLOAT NOT NULL,
  PRIMARY KEY (trial_id, vector_no),
  FOREIGN KEY (trial_id) REFERENCES Trials (trial_id)
);
  • Total space for a row of vector data: 300x(4+2+2+4) = 3600 bytes. Plus InnoDB record directory (internals stuff) of 16 bytes.

  • Total space if you serialize a Java array of 300 floats = 1227 bytes?

So you save about 2400 bytes, or 67% of the space by storing the array. But suppose you have 100GB of space to store the database. Storing a serialized array allows you to store 87.5 million vectors, whereas the normalized design only allows you to store 29.8 million vectors.

You said you store a few hundred vectors per day, so you'll fill up that 100GB partition in only 81 years instead of 239 years.


Re your comment: Performance of INSERT is an important issue, but you're only storing a few hundred vectors per day.

Most MySQL applications can achieve hundreds or thousands of inserts per second without excessive wizardry.

If you need optimal performance, here are some things to look into:

  • Explicit transactions
  • Multi-row INSERT syntax
  • INSERT DELAYED (if you still use MyISAM)
  • LOAD DATA INFILE
  • ALTER TABLE DISABLE KEYS, do the inserts, ALTER TABLE ENABLE KEYS

Search for the phrase "mysql inserts per second" on your favorite search engine to read many articles and blogs talking about this.



回答3:

If you just want to store the data as a binary dump of the Java array then, by all means, use a BLOB. Your friend may well be advising against this since you may want some non-Java program to use the information at some later date so binary dumps are probably a pain to interpret.

With serialization to a VARCHAR, you know the data format and can easily read it with any application.

Of course, if there's even the slightest chance that you'll want to manipulate or report on the individual floats, they should be stored in a database-friendly format. In other words, not a binary dump, not serialized, not a CSV column.

Store them as Codd intended, in third normal form.

By the way, a few hundred 300-element floating point arrays each day is not a big database. Take it from someone who works on the mainframe with DB2, most DBMS' will easily handle that sort of volume. We collect tens of millions of rows every day into our application and it doesn't even break into a sweat.



回答4:

Using a database to store an one dimensional array is pain in the ass! Even more using a rdm where is no relation between the data stored. sorry but the best solution imho is use a file and just write the data the way u like. binary or as txt. Thus 300xsize of long or 300x1 line of txt is one array.