What will happen to existing data if I change the

2019-02-17 01:54发布

问题:

I am running a production application with MySQL database server. I forget to set column's collation from latin to utf8_unicode, which results in strange data when saving to the column with multi-language data.

My question is, what will happen with my existing data if I change my collation to utf8_unicode now? Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?

I will change with phpMyAdmin web client.

回答1:

Running a quick test in MySQL 5.1 with a VARCHAR column set to latin1_bin I inserted some non-latin chars

INSERT INTO Test VALUES ('英國華僑');

I select them and get rubbish (as expected).

SELECT text from Test;

gives

text
????

I then changed the collation of the column to utf8_unicode and re-ran the SELECT and it shows the same result

text
????

This is what I would expect - It will keep the data and the data will remain rubbish, because when the data was inserted the column lost the extra character information and just inserted a ? for each non-latin character and there is no way for the ???? to again become 英國華僑.

Your data will stay in place but it won't be fixed.



回答2:

The article http://mysqldump.azundris.com/archives/60-Handling-character-sets.html discusses this at length and also shows what will happen.

Please note that you are mixing up a CHARACTER SET (actually an encoding) with a COLLATION.

A character set defines the physical representation of a string in bytes on disk. You can make this visible, using the HEX() function, for example SELECT HEX(str) FROM t WHERE id = 1 to see how MySQL stores the bytes of your string. What MySQL delivers to you may be different, depending on the character set of your connection, defined with SET NAMES .....

A collation is a sort order. It is dependent on the character set. For example, your data may be in the latin1 character set, but it may be ordered according to either of the two german sort orders latin1_german1_ci or latin1_german2_ci. Depending on your choice, Umlauts such as ö will either sort as oe or as o.

When you are changing a character set, the data in your table needs to be rewritten. MySQL will read all data and all indexes in the table, make a hidden copy of the table which temporarily takes up disk space, then moves the old table into a hidden location, moves the hidden table into place and then drops the old data, freeing up disk space. For some time inbetween, you will need two times the storage for that.

When you are changing a collation, the sort order of the data changes but not the data itself. If the column you are changing is not part of an index, nothing needs to be done besides rewriting the frm file, and sufficiently recent versions of MySQL should not do more.

When you are changing a collation of a column that is part of an index, the index needs to be rewritten, as an index is a sorted excerpt of a table. This will again trigger the ALTER TABLE table copy logic outlined above.

MySQL tries to preserve data doing this: As long as the data you have can be represented in the target character set, the conversion will not be lossy. Warnings will be printed if there is data truncation going on, and data which cannot be represented in the target character set will be replaced by ?



回答3:

Valid data will be properly converted:

When you change a data type using CHANGE or MODIFY, MySQL tries to convert existing column values to the new type as well as possible. Warning: This conversion may result in alteration of data.

http://dev.mysql.com/doc/refman/5.5/en/alter-table.html

... and more specifically:

To convert a binary or nonbinary string column to use a particular character set, use ALTER TABLE. For successful conversion to occur, one of the following conditions must apply:[...] If the column has a nonbinary data type (CHAR, VARCHAR, TEXT), its contents should be encoded in the column character set, not some other character set. If the contents are encoded in a different character set, you can convert the column to use a binary data type first, and then to a nonbinary column with the desired character set.

http://dev.mysql.com/doc/refman/5.1/en/charset-conversion.html

So your problem is invalid data, e.g., data encoded in a different character set. I've tried the tip suggested by the documentation and it basically ruined my data, but the reason is that my data was already lost: running SELECT column, HEX(column) FROM table showed that multibyte chars had been inserted as 0x3F (i.e., the ? symbol in Latin1). My MySQL stack had been smart enough to detect that input data was not Latin1 and convert it into something "compatible". And once data is gone, you can't get it back.

To sum up:

  1. Use HEX() to find out if you still have your data.
  2. Make your tests in a copy of your table.


回答4:

My question is, what will happen with my existing data if I change my collation to utf8_unicode now?

Answer: If you change to utf8_unicode_ci, nonthing will happen to your existing data (which is already corrupt and remain corrupt till you modify it).

Will it destroy or corrupt the existing data or will the data remain, but the new data will be saved as utf8 as it should?

Answer: After you change to utf8_unicode_ci, existing data will not be destroyed. It will remain the same like before (something like ????). However, if you insert new data containing Unicode characters, it will be stored correctly.

I will change with phpMyAdmin web client.

Answer: Sure, you can change collation with phpMyAdmin by going to Operations > Table options



回答5:

CAUTION! Some problems are solved via

ALTER TABLE ... CONVERT TO ...

Some are solved via a 2-step process

ALTER TABLE ... MODIFY ... VARBINARY...
ALTER TABLE ... MODIFY ... VARCHAR...

If you do the wrong one, you will have a worse mess!

  1. Do SELECT HEX(col), col ... to see what you really have.
  2. Study this to see what case you have: Trouble with utf8 characters; what I see is not what I stored
  3. Perform the correct fix, based on these cases: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases