How can I check that the trademark(™) character is set correctly in my Oracle database?
I expect it to be stored using UTF-8 encoding.
I have a value stored in a Salesforce.com field that looks like this from the GUI (notice the trademark character):
Chuck Norris's Roundhouse Kick™
I'm using Informatica to replicate it to an Oracle database. My database is set to use the AL32UTF8 encoding.
How it shows up in SQL Developer
When I query my table using SQL Developer, the trademark symbol shows up as a rectangle (black border, white fill).
How it shows up in HTML
When I export it from SQL Developer using the UTF-8 encoding into an HTML document, and open it in Chrome, the trademark symbol does not appear at all. When I open it in IE, the it appears as a rectangle again. In Firefox, it's a rectangle with 00
in the top half and 99
in the bottom half. All three browsers interpret the HTML doc using UTF-8.
How it shows up in text editors
Opening the same HTML doc in Notepad and Notepad++, the trademark symbol shows up as a rectangle. If I use the Hex Viewer plugin for Notepad++ I see the byte encoding is C2 99
. That seems to be the correct encoding for the trademark symbol in UTF-8.
When I open document in MS Write, the trademark character looks like this: ™
.
When I get the value programmatically
Using Python, when I get the value from the database, the trademark character is replaced with '\xbf' -- the inverted question mark, but that character is not even properly encoded as far as I can tell because it's missing at least one leading byte (depending on the specific encoding)
>>> import cx_Oracle
>>> con = cx_Oracle.connect('username', 'password', 'db')
>>> cur = con.cursor()
>>> cur.execute('select * from trademark')
<__builtin__.OracleCursor on <cx_Oracle.Connection to username@db>>
>>> records = cur.fetchall()
>>> records[0][0]
"Chuck Norris's Roundhouse Kick\xbf"
Ideally, I'd like to be able to validate the data stored in my Oracle database using all of the above methods. I'd settle for someone just validating that what I saw in the Hex Viewer was enough of a "test" ;)
Just for any future reference, because the author did not bother posting a fix. It's indeed an Informatica problem, what is needed:
The character literal ™ you posted is not U+0099 (a control character), but U+2122 (TRADE MARK SIGN).
The Unicode spec defines U+0099 as follows:
So, it doesn't even have a name and I haven't gone digging round the spec to find out what this character is for.
Decoding U+0099 in Windows does result in a trademark grapheme. I guess this is a bug.
The correct byte sequence for the TRADE MARK SIGN (U+2122) in UTF-8 is
E2 84 A2
.if you are saving this string for output in an html doc use:
™
the html entity for the trademark symbol.if you are using this string for non html purposes decode the string at runtime use:
see: http://www.w3schools.com/html/html_entities.asp http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/