Java; Trying to convert a String which contains IS

2020-07-22 10:20发布

I don't know if this is going to make sense but this is what I make of it.

I'm working with Eclipse using UTF-8 encoding for all my files. In one of them I need to convert a String from ISO-8859-1 to UTF-8. However that string is formed within the file itself (doesn't come from input) which is why I believe my String starts out as UTF-8 and the conversion doesn't go the way i expected.

The String original content is:

||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÁREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||

Which original encoding should be ISO-8859-1 and when I convert it to UTF-8 it should generate.

||3.2|2013-01-25T17:05:06|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÃREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÃREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÃREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||

Which is what I need, and I'm not achieving it.

this is what I have tried so far.

    String input = null;
    input = "||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JUÁREZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JUÁREZ|ESTADO DE MEXICO|MÉXICO|53100|Persona Física con Actividad Empresarial|BAÑ930616R66|BAÑOMOBIL, S.A. DE C.V.|Av. 1° de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|MÉXICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 año www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 año www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||";
    String intento1 = null, intento2 = null, intento3 = null;
    try {
        intento1 = new String(input.getBytes("ISO-8859-1"),"UTF-8");
        intento2 = new String(intento1.getBytes(), "UTF-8");
        intento3 = new String(input.getBytes(),"UTF-8");
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    System.out.println(intento1);
    System.out.println(intento2); 
    System.out.println(intento3);   

Which returns

||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||
||3.2|2013-01-25T17:24:00|ingreso|PAGO EN UNA SOLA EXHIBICION|6386.21|MXN|7408.00|No identificado|NAUCALPAN DE JU?REZ, ESTADO DE MEXICO|CAOS640116HT5|OSCAR MARTIN CARRERA|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|CTO. ORADORES 33|33|CD. SATELITE|NAUCALPAN DE JU?REZ|ESTADO DE MEXICO|M?XICO|53100|Persona F?sica con Actividad Empresarial|BA?930616R66|BA?OMOBIL, S.A. DE C.V.|Av. 1? de Mayo|197|San. Lorenzo|TLALNEPANTLA DE BAZ|ESTADO DE MEXICO|M?XICO|54047|1|NO APLICA|Dominio .com|Dominio por 1 a?o www.sanitariosportatiles.com|586.21|586.21|1|NO APLICA|Hospedaje 2 Gb|Hospedaje 2 Gb por 1 a?o www.sanitariosportatiles.com|5800.00|5800.00|IVA|16.00|1021.79|1021.79||

Which is not near what I want.

EDIT 1: When I get the String from an Input one of the conversions work fine, but I need it to work declared inside the file.

EDIT 2: This is basically what I need http://cryptosys.net/cgi-bin/manual.cgi?m=pki&name=CNV_UTF8FromLatin1 but in java

4条回答
孤傲高冷的网名
2楼-- · 2020-07-22 10:45

I finally got it to show the way I specified in the question, I was just using the wrong charset.

intento2 = new String(input.getBytes(Charset.forName("UTF-8")), Charset.forName("Windows-1252"));

This displayed it the way I needed it.

查看更多
在下西门庆
3楼-- · 2020-07-22 10:46

When loading any data from binary representation, you must know what encoding is used for that representation in order to interpret or decode it. If you assume the wrong encoding, then you will probably get garbage -- something that does not make sense.

In order to construct a String from binary data, you have to specify the encoding of the source data. Otherwise you may get garbage -- the constructed String may not contain the characters represented in the source data.

More specifically for your case, if you try to load UTF-8 data using the ISO-8859-1 encoding, you may get garbage. I say "may" because these two encodings actually have a lot of overlap: the low 127 code points (if I remember correctly). If only these low 127 code points are used, the decoding may actually "work", but since this is not guaranteed it should not be relied on.

If you are telling Eclipse to decode your source files using UTF-8, then you should only edit those source files using an editor capable of and configured for editing using UTF-8 encoding.

One more point: The internal representation of String data in Java is UTF-16. Therefore, it is incorrect to say that you have Strings which "contain ISO-8859-1 encoding". If you have a String, you always have UTF-16 data. Whether or not that data is correct or not depends on how you have constructed the String, as discussed above.

查看更多
在下西门庆
4楼-- · 2020-07-22 10:49

In Simple Words ,if you want to convert charset=iso-8859-1 to java string (which is UTF-8 by default)

 String response= new String(input.getBytes("ISO-8859-1"),"UTF-8");
查看更多
神经病院院长
5楼-- · 2020-07-22 11:06

I think the fundamental problem here is your expectations.

If I understand you correctly, you expect to be able to change Á to à by changing character encodings. That cannot happen. Those are different characters; i.e. different code points - Á is Unicode codepoint 00C1 (or C1 in ISO-8859-1) and à is 00C3 / C3.

So when you transcode a Á in ISO-8859-1 to Unicode to UTF-8 you should get exactly the same character Á. If you don't then the translation would be broken.

You also expect MÉXICO to translate to MÉXICO ... which seems totally bizarre to me. Perhaps there's a problem in your transcription of the characters into the Question ...

Now if the lexicography rules for your language / region say that Á to à are actually equivalent, then it would be reasonable to "normalize" to a preferred form. However, it is not the role of the character encoding / decoding to do such locale-related translations. You need to code it yourself ... or find some other library that does it.


Messing around at the byte level (encoding with one charset and decoding with a different one) is not going to "fix" this. If anything it is going to make things worse. Your messing around is generating byte sequences that can't be mapped to the target encoding scheme ... and hence the question marks.

查看更多
登录 后发表回答