Java jaxb utf-8/iso convertions

2019-06-04 02:11发布

问题:

I have a XML file that contains non-standard characters (like a weird "quote").

I read the XML using UTF-8 / ISO / ascii + unmarshalled it:

BufferedReader br = new BufferedReader(new InputStreamReader(
                (conn.getInputStream()),"ISO-8859-1"));
        String output;
        StringBuffer sb = new StringBuffer();
        while ((output = br.readLine()) != null) {
            //fetch XML
            sb.append(output);
        }


        try {

            jc = JAXBContext.newInstance(ServiceResponse.class);

            Unmarshaller unmarshaller = jc.createUnmarshaller();

            ServiceResponse OWrsp =  (ServiceResponse) unmarshaller
                    .unmarshal(new InputSource(new StringReader(sb.toString())));

I have a oracle function that will take iso-8859-1 codes, and converts/maps them to "literal" symbols. i.e: "&#x2019" => "left single quote"

JAXB unmarshal using iso, displays the characters with iso conversion fine. i.e all weird single quotes will be encoded to "&#x2019"

so suppose my string is: class of 10–11‐year‐olds (note the weird - between 11 and year)

jc = JAXBContext.newInstance(ScienceProductBuilderInfoType.class);
        Marshaller m = jc.createMarshaller();
        m.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
        //save a temp file
        File file2 = new File("tmp.xml");

this will save in file :

class of 10–11‐year‐olds. (what i want..so file saving works!)

[side note: i have read the file using java file reader, and it out puts the above string fine]

the issue i have is that the STRING representation using jaxb unmarshaller has weird output, for some reason i cannot seem to get the string to represent –.

when I 1: check the xml unmarshalled output:

class of 10?11?year?olds

2: the File output:

class of 10–11‐year‐olds

i even tried to read the file from the saved XML, and then unmarshal that (in hopes of getting the – in my string)

String sCurrentLine;
        BufferedReader br = new BufferedReader(new FileReader("tmp.xml"));
        StringBuffer sb = new StringBuffer();
        while ((sCurrentLine = br.readLine()) != null) {
            sb.append(sCurrentLine);
        }




        ScienceProductBuilderInfoType rsp =  (ScienceProductBuilderInfoType) unm
                .unmarshal(new InputSource(new StringReader(sb.toString())));

no avail.

any ideas how to get the iso-8859-1 encoded character in jaxb?

回答1:

Solved: using this tibid code found on stackoverflow

final class HtmlEncoder {
  private HtmlEncoder() {}

  public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
      T out) throws java.io.IOException {
    for (int i = 0; i < sequence.length(); i++) {
      char ch = sequence.charAt(i);
      if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
        out.append(ch);
      } else {
        int codepoint = Character.codePointAt(sequence, i);
        // handle supplementary range chars
        i += Character.charCount(codepoint) - 1;
        // emit entity
        out.append("&#x");
        out.append(Integer.toHexString(codepoint));
        out.append(";");
      }
    }
    return out;
  }
}

HtmlEncoder.escapeNonLatin(MYSTRING)