replace garbage characters within mysql

2019-08-13 19:02发布

问题:

My db is in latin1 and is full of â" or '��"' (depending on whether my terminal is set to latin1 or unicode, respectively). From context, I think they should be emdashes. They appear to be causing nasty bugs when rendered (or not rendered) in IE. I'd like to find and replace them. The problem is that neither the â nor � character match with replace. Running the query:

    update TABLE set COLUMN = replace(COLUMN,'��"','---');

Executes without error but doesn't do anything (0 rows changed). It's clear to me that the "question mark in the diamond" character is not being matched when I copy it in the terminal. Is there a way to find out its code and match it by that or something? The mysql console is tantalizingly close to being able to do this in one line so I'd rather not script it outside the terminal if I can avoid it.

The db is hosted Amazon RDS so I can't install the regexp udf that I've seen referenced in other questions here. In the long term, I'm going to have to properly convert the whole db to utf8 but I need to fix this rendering problem right away.

EDIT:

I've isolated the bad character with hexdump, it's e2 80 (I don't think this corresponds to any unicode character). How can I feed that to the replace function?

    update TABLE set COLUMN = replace(COLUMN, char(0xe2,0x80),'---');

does not do anything.

回答1:

I figured it out. I used mysql's builtin hex function to dump an entry that I knew was bad.

    select hex(column) from table where id=666;

Then picked out the words (those numbers sandwiched between "20"s) and discovered that my offending set of bytes was in fact x'C3A2E282AC2671756F743B'. How this corresponds to the way I saw it encoded in PHP and by my system (as e2 80) I don't know and at this point, I don't really care.

To verify, before destroying the data, you plug that back in to mysql:

    select x'C3A2E282AC2671756F743B';
    +---------------------------+
    | x'C3A2E282AC2671756F743B' |
    +---------------------------+
    | â€"               |
    +---------------------------+
    1 row in set (0.00 sec)

So, using the replace query like above, I was able to get rid of all the bad data at once.

For the record it was:

    update TABLE set COLUMN = replace(COLUMN, x'C3A2E282AC2671756F743B','--');

I really hope this is useful for someone. Though encoding snafus appear to be pretty common in mysql, I searched everywhere and I couldn't find an explanation for this ultimately rather simple process.



回答2:

Something might have submitted the em dash (U+2014) with UTF-8 encoding (E2 80 94), which you are now trying to interpret as latin-1, however this is not valid in latin-1 which excludes the range 80-9F so you probably see Unicode replacement characters for the illegal bytes instead, thus resulting in � when you display it. By comparison in Windows-1252 it would display as —.

You may be able to use CHAR(0xE2, 0x80, 0x94) to create the search string.