Java Unicode to readable text conversion decoding

2019-09-14 00:10发布

问题:

I am developing a Java application where I am consuming a web service. The web service is created using a SAP server, which encodes the data automatically in Unicode. I get a Unicode string from the web service.

" 倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2 "

above is the response.

I want to convert it to readable text format like String. I am using core Java.

回答1:

If you have byte[] or an InputStream (both binary data) you can get a String or a Reader (both text) with:

final String encoding = "UTF-8"; // "UTF16LE" or "UTF-16BE"

byte[] b = ...;
String s = new String(b, encoding);

InputStream is = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(is, encoding));
for (;;) {
    String line = reader.readLine();
}

The reverse process uses:

byte[] b = s.geBytes(encoding);
OutputStream os = ...;

BufferedWriter writer = new BufferedWriter(new OuputStreamWriter(os, encoding));
writer.println(s);

Unicode is a numbering system for all characters. The UTF variants implement Unicode as bytes.


Your problem:

In normal ways (web service), you would already have received a String. You could write that string to a file using the Writer above for instance. Either to check it yourself with a full Unicode font, or to pass the file on for a check.

You need (?) to check, which UTF variant the text is in. For Asiatic scripts UTF-16 (little endian or big endian) are optimal. In XML it would be defined already.


Addition:

FileWriter writes to a file using the default encoding (from operating system on your machine). Instead use:

new OutputStreamWriter(new FileOutputStream(new File("...")), "UTF-8")

If it is a binary PDF, as @bobince said, use just a FileOutputStream on byte[] or InputStream.



回答2:

倥䙄ㄭ㌮਍쿣ී㈊〠漠橢਍圯湩湁楳湅潣楤杮਍湥潤橢਍″‰扯൪㰊഼┊敄瑶灹⁥佐呓′†䘠湯⁴佃剕䕉⁒渠牯慭慌杮䔠ൎ⼊祔数⼠潆瑮਍匯扵祴数⼠祔数റ⼊慂敳潆瑮⼠潃牵敩൲⼊慎敭⼠う㄰਍䔯据摯湩⁧′‰൒㸊ാ攊摮扯൪㐊〠漠橢਍㰼਍䰯湥瑧⁨‵‰൒㸊ാ猊牴慥൭ 䘯〰‱⸱2

That's a PDF file that has been interpreted as UTF-16LE.

You need to look at what component is receiving the response and how it's dealing with the input to stop it being decoded as UTF-16LE, but ultimately there isn't a 'readable' version of it as such, as it's a binary file. Extracting the document text out of a PDF file is a much bigger problem!

(Note: Unicode is a character set, UTF-16LE is an encoding of that set into bytes. Microsoft call the UTF-16LE encoding "Unicode" due to a historical accident, but that's misleading.)



回答3:

This is definitely not a valid string. This looks like mangled UTF-16.

UPDATE

Indeed @Bobince is right, this is a PDF file (most probably in UTF-8 / or plain ASCII) displayed in UTF-16. When Displayed in UTF-8 this string indeed shows PDF source code. Good catch.