How can I check that the trademark(™) character is

2019-07-15 15:50发布

问题:

How can I check that the trademark(™) character is set correctly in my Oracle database?

I expect it to be stored using UTF-8 encoding.

I have a value stored in a Salesforce.com field that looks like this from the GUI (notice the trademark character):

Chuck Norris's Roundhouse Kick™

I'm using Informatica to replicate it to an Oracle database. My database is set to use the AL32UTF8 encoding.

How it shows up in SQL Developer

When I query my table using SQL Developer, the trademark symbol shows up as a rectangle (black border, white fill).

How it shows up in HTML

When I export it from SQL Developer using the UTF-8 encoding into an HTML document, and open it in Chrome, the trademark symbol does not appear at all. When I open it in IE, the it appears as a rectangle again. In Firefox, it's a rectangle with 00 in the top half and 99 in the bottom half. All three browsers interpret the HTML doc using UTF-8.

How it shows up in text editors

Opening the same HTML doc in Notepad and Notepad++, the trademark symbol shows up as a rectangle. If I use the Hex Viewer plugin for Notepad++ I see the byte encoding is C2 99. That seems to be the correct encoding for the trademark symbol in UTF-8.

When I open document in MS Write, the trademark character looks like this: ™.

When I get the value programmatically

Using Python, when I get the value from the database, the trademark character is replaced with '\xbf' -- the inverted question mark, but that character is not even properly encoded as far as I can tell because it's missing at least one leading byte (depending on the specific encoding)

>>> import cx_Oracle
>>> con = cx_Oracle.connect('username', 'password', 'db')
>>> cur = con.cursor()
>>> cur.execute('select * from trademark')
<__builtin__.OracleCursor on <cx_Oracle.Connection to username@db>>
>>> records = cur.fetchall()
>>> records[0][0]
"Chuck Norris's Roundhouse Kick\xbf"

Ideally, I'd like to be able to validate the data stored in my Oracle database using all of the above methods. I'd settle for someone just validating that what I saw in the Hex Viewer was enough of a "test" ;)

回答1:

The character literal ™ you posted is not U+0099 (a control character), but U+2122 (TRADE MARK SIGN).

The Unicode spec defines U+0099 as follows:

0099;<control>;Cc;0;BN;;;;;N;;;;;

So, it doesn't even have a name and I haven't gone digging round the spec to find out what this character is for.

Decoding U+0099 in Windows does result in a trademark grapheme. I guess this is a bug.

The correct byte sequence for the TRADE MARK SIGN (U+2122) in UTF-8 is E2 84 A2.



回答2:

Just for any future reference, because the author did not bother posting a fix. It's indeed an Informatica problem, what is needed:

  1. Change connection properties in odbc.ini on Informatica box, add "IANAAppCodePage=106" to the connections that need UTF8.
  2. Change connection properties in Informatica itself for the connection and add "Codepage=Utf-8" in "Connection Manager -> Connections -> Relational -> -> Edit"


回答3:

if you are saving this string for output in an html doc use: &trade; the html entity for the trademark symbol.

if you are using this string for non html purposes decode the string at runtime use:

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape('&trade;')

see: http://www.w3schools.com/html/html_entities.asp http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/