Why is XmlParser converting my character hex code

2019-09-19 05:58发布

问题:

In my Grails application I use Groovy's XmlParser to parse an XML file. The value of one of the attributes in my XML file is a string that equals a character hex code. I want to save that string in my database:

Ñ

Unfortunately the attribute method returns the Ñ character and what actually gets stored in the database is c391. When the field is read back out I also get the Ñ character which is undesired.

How can I store the hex code as a string in my database and make sure it gets read back out as a hex code as well?

Update #1:

The reason this is a problem for me is that once I read the XML file into my database I must be able to reconstruct it exactly as it was. An additional problem is that the field in question isn't always a character hex code. It could just be some arbitrary string.

Update #2:

I guess it doesn't matter how the character is stored in the database, so long as I can write it back out in its expanded hex code format. I am using Groovy MarkupBuilder to reconstruct my XML file from the database and I am unclear why this isn't happening by default.

Update #3:

I overrode getTableTypeString in my custom MySQL dialect and that seems to have helped things some what. At least now the value I pass to MySQL is the value that gets stored in the database.

class CustomMySQL5InnoDBDialect extends MySQL5InnoDBDialect {   
    @Override
    public String getTableTypeString() {
        return " ENGINE=InnoDB DEFAULT CHARSET=utf8"
    }
}

I also created my own version of groovy.util.XmlParser. My version is pretty much an exact duplicate of groovy.util.XmlParser except that in the startElement method I changed:

String value = list.getValue(i)

to this:

def value = list.fAttributes.fAttributes[i].nonNormalizedValue
if(value ==~ /&#x([0-9A-F]+?);/) {
    value = list.fAttributes.fAttributes[i].nonNormalizedValue
}

This allows the exact text of hex code elements to be stored in the database.

Now there are two new problems, possibly three.

  1. Recreating a file with the exact values stored in the database. Up till now I had been using MarkupBuilder, but that is doing extra encoding on ampersands, causing the value Ñ to be written out as Ñ I can probably get around this by abandoning MarkupBuilder and building my XML strings manually, but I would rather not.

  2. Running an XSLT transform on an XML file using the Saxon-HE 9.4 processor causes some hex code values such as ÿ to be changed to something like ÿ, yet others like ™ are left unchanged.

  3. I'm not sure if this is going to be a problem yet or not, but when I recreate the file I would like it to be in ANSI encoding since that is the encoding used for the original file.

回答1:

Ok, so given the xml:

def xml = '''<root>
    <node woo="&#xD1;"/>
    <another attr="This is an N-Tilde - &#xD1;"/>
</root>'''

We can read that attribute into a variable:

def woo = new XmlParser().parseText( xml ).node[0].@woo

And printing it out give us 'Ñ' (with a character value of 209)

But that's what I'd expect... as &#xD1; is the same as &#209; which is the correct encoding for N-tilde

Ahhh, so is the question "How can I read attributes, and keep them as-is without any entity resolving"?

I don't believe you can (all I've seen is negative answers from a search of the web)... What you could do is something like:

// Mask entities

xml = xml.replaceAll( /&#x([0-9A-F]+?);/, '!!#x$1;' )

def parser = new XmlParser().parseText( xml )

println parser.node[0].@attr.replaceAll( /!!#x([0-9A-F]+?);/, '&#x$1;' )
println parser.another[0].@attr.replaceAll( /!!#x([0-9A-F]+?);/, '&#x$1;' )

But as far as I know, there's not a method for tuning off entity resolution :-( (fingers crossed I'm wrong)



回答2:

The value of one of the attributes in my XML file is a string that equals a character hex code

No it isn't. The representation of the attribute value in the original XML is a hexadecimal character reference, but the value of the attribute is the character Ñ. There are ways to configure some XML parsers to avoid expanding named entity references during parsing, but they must expand numeric character references as per the XML spec.

You haven't said why storing the real character value is a problem. If it's to do with rendering the value to a browser then that can be handled by using .encodeAsHTML() at output time. If you need to save the value to another XML file then use an XML API to do so and it will handle the encoding issues for you, replacing characters with entities or character references where this is required to keep the result well-formed (in the case of Ñ it doesn't need to be escaped anyway unless you're writing XML in an unusual character set).

In the specific case of Groovy's MarkupBuilder you can temporarily escape from XML mode and write hand-constructed markup directly to the output stream using mkp.yieldUnescaped, which would let you output a character reference somewhere the builder wouldn't normally bother.