I have a problem in DOM parsing Arabic letters, I got weird characters. I've tried changing to different encoding but I couldn't.
the full code is on this link: http://test11.host56.com/parser.java
public Document getDomElement(String xml) {
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
Reader reader = new InputStreamReader(new ByteArrayInputStream(
xml.getBytes("UTF-8")));
InputSource is = new InputSource(reader);
DocumentBuilder db = dbf.newDocumentBuilder();
//InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xml));
doc = db.parse(is);
return doc;
}
}
my xml file
<?xml version="1.0" encoding="UTF-8"?>
<music>
<song>
<id>1</id>
<title>اهلا وسهلا</title>
<artist>بكم</artist>
<duration>4:47</duration>
<thumb_url>http://wtever.png</thumb_url>
</song>
</music>
You already have the xml as
String
, so unless that string already contains the odd characters (that is, it has been read in with the wrong encoding), you can avoid encoding madness here by using a StringReader instead; e.g. instead of:use:
Edit: now that I see more of the code, it seems the encoding issue already happend before the XML is parsed, because that part contains:
The javadoc for the
EntityUtils.toString
says:It seems the server does not send the proper encoding information with the entity, and then the HttpUtils uses a default, which is not UTF-8.
Fix: use the variant that takes an explicit default encoding:
Here I assume the server sends UTF-8. If the server uses a different encoding, that one should be set instead of UTF-8. (However as the XML also declares
encoding="UTF-8"
I thought this is the case.) If the encoding the server uses is not known, then you can only resort to wild guessing and are out of luck, sorry.If the XML contains Unicode characters such as Arabic or Persian letters,
StringReader
would make an exception. In these cases, pass theInputStream
straightly to theDocument
object.