Storing base64 encoded data as BLOB or TEXT dataty

2019-01-18 03:56发布

问题:

We have a MySQL InnoDB table holding ~10 columns of small base64 encoded javascript files and png (<2KB size) images base64 encoded as well.

There are few inserts and a lot of reads comparatively, however the output is being cached on a Memcached instance for some minutes to avoid subsequent reads.

As it is right now we are using BLOB for those columns, but I am wondering if there is an advantage in switching to TEXT datatype in terms of performance or snapshot backing up.

My search digging indicates that BLOB and TEXT for my case are close to identical and since I do not know before-hand what type of data are actually going to be stored I went for BLOB.

Do you have any pointers on the TEXT vs BLOB debate for this specific case?

回答1:

One shouldn't store Base64-encoded data in one's database...

Base64 is a means of representing arbitrary binary data using only printable text characters: it was designed for situations where one needs to transfer such binary data across a protocol or medium that can handle only printable-text (e.g. SMTP/email). It increases the data size (by 33%) and adds the computational cost of encoding/decoding, so it should be avoided unless absolutely necessary.

By contrast, the whole point of BLOB columns is that they store raw binary strings. So just go ahead and store your stuff directly into your BLOB columns without first Base64-encoding them. Usually you'll want to store related metadata in other columns, such as file version/last modified date, media type, and (in the case of text files, such as JavaScript sources) character encoding. You might decide to use TEXT type columns for the text files, not only so that MySQL will natively track character encoding for you, but also so that it can transcode to alternative character sets and/or inspect/manipulate the text as may be required (now or in the future).

The (erroneous) idea that SQL databases require printable-text encodings like Base64 for handling arbitrary binary data has been perpetuated by a large number of ill-informed tutorials. This idea appears to be seated in the mistaken belief that, because SQL comprises only printable-text in other contexts, it must surely require it for binary data too (at least for data transfer, if not for data storage). This is simply not true: SQL can convey binary data in a number of ways, including plain string literals (provided that they are properly quoted and escaped like any other string); of course, the preferred way to pass data (of any type) to your database is through parameterised queries, and parameters can just as easily contain binary data as they can anything else.

For what it's worth, I usually altogether avoid storing items like this in the RDBMS and prefer instead to use those highly optimised file storage databases known as filesystems—but that's another matter altogether.

...unless it's cached for performance reasons...

The only situation in which there might be some benefit from storing Base64-encoded data is where data is frequently retrieved from the database and transmitted across a protocol that requires that encoding—in which case, storing the Base64-encoded representation would save from having to perform the encoding operation on the otherwise raw data upon every fetch.

However, note in this sense that the Base64-encoded storage is merely acting as a cache, much like one might store denormalised data for performance reasons.

...in which case it should be TEXT not BLOB

As alluded to above, the difference between TEXT and BLOB really comes down to the fact that TEXT columns are stored together with text-specific metadata (such as character encoding and collation), whereas BLOB columns are not. This additional metadata enables MySQL to transcode characters between storage and connection character sets (where appropriate) and perform fancy character equivalence/ordering.

Generally speaking: if two clients working in different character sets should see the same bytes, then you want a BLOB column; if they should see the same characters then you want a TEXT column.

With Base64, those two clients must ultimately find that the data decodes to the same bytes; but they should see that the encoded data has the same characters. For example, suppose one wishes to insert the Base64-encoding of 'Hello world!' (which is 'SGVsbG8gd29ybGQh'). If the inserting application is working in the UTF-8 character set, then it will send the byte sequence 0x53475673624738676432397962475168 to the database.

  • if that byte sequence is stored in a BLOB column and later retrieved by an application that is working in UTF-16, the same bytes will be returned—which represent '升噳扇㡧搲㥹扇全' and not the desired Base64-encoded value; whereas

  • if that byte sequence is stored in a TEXT column and later retrieved by an application that is working in UTF-16, MySQL will transcode on-the-fly to return the byte sequence 0x0053004700560073006200470038006700640032003900790062004700510068—which represents the original Base64-encoded value 'SGVsbG8gd29ybGQh' as desired.

Of course, you could nevertheless use BLOB columns and track the character encoding in some other way—but that would just needlessly reinvent the wheel, with added maintenance complexity and risk of introducing unintentional errors.