I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.
"
倥䙄ㄭ㌮쿣ී㈊〠漠橢圯湩湁楳湅潣楤杮湥潤橢″‰扯൪㰊഼┊敄瑶灹佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う䔯据摯湩′‰㸊ാ攊摮扯൪㐊〠漠橢㰼䰯湥瑧‵‰㸊ാ猊牴慥൭ 䘯〰‱⸱2
"
above is the response.
I want to convert it to readable text format like String. I am using core Java.
If you have byte[]
or an InputStream
(both binary data) you can get a String
or a Reader
(both text) with:
final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"
byte[] b = ...;
String s = new String(b, encoding);
InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
String line = reader.readLine();
}
The reverse process uses:
byte[] b = s.geBytes(encoding);
OutputStream os = ...;
BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);
Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.
Your problem:
In normal ways (web service), you would already have received a String
. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.
You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.
Addition:
FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:
new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")
If it is a binary PDF, as @bobince said, use just a FileOutputStream on byte[] or InputStream.
倥䙄ㄭ㌮쿣ී㈊〠漠橢圯湩湁楳湅潣楤杮湥潤橢″‰扯൪㰊഼┊敄瑶灹佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う䔯据摯湩′‰㸊ാ攊摮扯൪㐊〠漠橢㰼䰯湥瑧‵‰㸊ാ猊牴慥൭ 䘯〰‱⸱2
That's a PDF file that has been interpreted as UTF-16LE.
You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!
(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)
This is definitely not a valid string. This looks like mangled UTF-16.
UPDATE
Indeed @Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.