mysql-python collation issue: how to force unicode

2019-07-13 12:03发布

问题:

For certain purposes I had to change field collations from utf8_unicode_ci to utf8_bin in a database. It turned out that the change lead to changes in datatypes that come to python.

The question is how to force mysql-python to return unicode objects to python.

Here is a sample that shows the problem (explicit charset forces use_unicode=1):

>>> con = MySQLdb.connect(..., charset='utf8')
>>> c = c.cursor()
>>> c.execute('SELECT %s COLLATE utf8_bin', u'м')
1L
>>> c.fetchone()
('\xd0\xbc',)
>>> c.description
(("'\xd0\xbc' COLLATE utf8_bin", 253, 2, 3, 3, 31, 0),)


>>> c.execute('SELECT %s COLLATE utf8_unicode_ci', u'м')
1L
>>> c.fetchone()
(u'\u043c',)
>>> c.description
(("'\xd0\xbc' COLLATE utf8_unicode_ci", 253, 2, 3, 3, 31, 0),)

In my database the fields are of type VARCHAR, but after the change they behave like BINARY which is not what I want.

回答1:

It turns out, that the problem is rather awkward. In short, most variaties and species in MySQL string datatypes map to a single datatype in MySQL's interface with an additional BINARY flag.

Thus, MySQL's VARCHAR, VARBINARY, and a string literal map to the same MySQLdb.constants.FIELD_TYPE.VAR_STRING type in column type definitions, but having an additional MySQLdb.constants.FLAG.BINARY flag when the type is VARBINARY or a string collated with a *_bin collation.

Even though there is a MySQLdb.constants.FIELD_TYPE.VARCHAR type, I failed to find out when it is used. As I said, MySQL VARCHAR columns maps to FIELD_TYPE.VAR_STRING.

The solution becomes rather fragile, if your application uses true binary strings (for example, you store images and fetch them with the same connection as text), since it assumes decoding all binary strings to unicode. Though, it works.

As official docs states:

Because MySQL returns all data as strings and expects you to convert it yourself. This would be a real pain in the ass, but in fact, _mysql can do this for you. (And MySQLdb does do this for you.) To have automatic type conversion done, you need to create a type converter dictionary, and pass this to connect() as the conv keyword parameter.

In practice, real pain in the ass might be the process of constructing your own converters dictionary. But you can import the default one from MySQLdb.converters.conversions and patch it, or even patch it on an instance of the Connection. The trick is to remove a special converter for a FLAG.BINARY flag and add a decoder for all cases. If you explicitly specify a charset parameter for MySQLdb.connect, it forces use_unicode=1 parameter, which adds the decoder for you, but you can do it yourself:

>>> con = MySQLdb.connect(**params)
>>> con.converter[FIELD_TYPE.VAR_STRING]
[(128, <type 'str'>), (None, <function string_decoder at 0x01FFA130>)]
>>> con.converter[FIELD_TYPE.VAR_STRING] = [(None, con.string_decoder)]
>>> c = con.cursor()
>>> c.execute("SELECT %s COLLATE utf8_bin", u'м')
1L
>>> c.fetchone()
(u'\u043c',)

You might probably need to make the same hack for FIELD_TYPE.STRING if required.

Another solution is to pass explicit use_unicode=0 to MySQLdb.connect and make all decodings in your code, but I would not.

Hope, this might be useful to someone.



回答2:

It's a large amount of change from using Mysql-Python at the low level, but I think a better idea is to use something like sqlalchemy instead of using the db-api directly, then you could use e.g. types.Unicode and know that it's doing what's required for unicode support for the db-api

Before you jump on me for not answering the question directly, consider this: mysql-python aka MySQLdb is only one of several db-api's for MySQL. It's likely MySQLdb will continue to be supported with new versions, but there are circumstances (such as moving to python 3x, or a host you have no ability to install binary modules on) which may force you to use something else in the future, such as oursql or myconnpy. The people who make sqlalchemy have gone through a lot of effort to support multiple db-api's, and in the case of mysql-python, have even worked around severe bugs in the past. With sqlalchemy, changing to a different db-api would be as simple as changing the connect URL, and it would make sure anything on datatype coercion is handled as you would expect.

That said, to utilize this you'd then need to define your tables in terms of sqlalchemy's schemas and use their query API, but you'd get a lot for it.