DOM parser in Arabic

2019-08-27 15:08发布

I have a problem in DOM parsing Arabic letters, I got weird characters. I've tried changing to different encoding but I couldn't.

the full code is on this link: http://test11.host56.com/parser.java

public Document getDomElement(String xml) {
    Document doc = null;
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
   try {
       Reader reader = new InputStreamReader(new ByteArrayInputStream(
       xml.getBytes("UTF-8")));
       InputSource is = new InputSource(reader);

       DocumentBuilder db = dbf.newDocumentBuilder();

       //InputSource is = new InputSource();
       is.setCharacterStream(new StringReader(xml));
       doc = db.parse(is);

       return doc;
   }
}

my xml file

<?xml version="1.0" encoding="UTF-8"?>
<music>
<song>
    <id>1</id>    
    <title>اهلا وسهلا</title>
    <artist>بكم</artist>
    <duration>4:47</duration>
    <thumb_url>http://wtever.png</thumb_url>
</song>
</music>

2条回答
等我变得足够好
2楼-- · 2019-08-27 15:55

You already have the xml as String, so unless that string already contains the odd characters (that is, it has been read in with the wrong encoding), you can avoid encoding madness here by using a StringReader instead; e.g. instead of:

Reader reader = new InputStreamReader(new ByteArrayInputStream(
   xml.getBytes("UTF-8")));

use:

Reader reader = new StringReader(xml);

Edit: now that I see more of the code, it seems the encoding issue already happend before the XML is parsed, because that part contains:

HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
xml = EntityUtils.toString(httpEntity);

The javadoc for the EntityUtils.toString says:

The content is converted using the character set from the entity (if any), failing that, "ISO-8859-1" is used.

It seems the server does not send the proper encoding information with the entity, and then the HttpUtils uses a default, which is not UTF-8.

Fix: use the variant that takes an explicit default encoding:

xml = EntityUtils.toString(httpEntity, "utf-8");

Here I assume the server sends UTF-8. If the server uses a different encoding, that one should be set instead of UTF-8. (However as the XML also declares encoding="UTF-8" I thought this is the case.) If the encoding the server uses is not known, then you can only resort to wild guessing and are out of luck, sorry.

查看更多
欢心
3楼-- · 2019-08-27 16:00

If the XML contains Unicode characters such as Arabic or Persian letters, StringReader would make an exception. In these cases, pass the InputStream straightly to the Document object.

查看更多
登录 后发表回答