Map supplementary Unicode characters to BMP (if po

2019-05-26 22:14发布

问题:

I ran into the issue that my XML parser (VTD-XML) doesn't seem to be able to handle Unicode Supplementary characters (please correct if I'm already wrong here). It seems, the parser only uses the lower 16 bit of such characters.

I cannot switch to another parser within the project I'm occupied with. I am parsing Medline abstracts (https://www.ncbi.nlm.nih.gov/pubmed) and it seems there have been added documents that contain supplementary characters over the last year (e.g. https://www.ncbi.nlm.nih.gov/pubmed/?term=26855708, ends of results section).

As a quick and dirty fix I would just delete all characters above 0xFFFF from the documents. Obviously, that will destroy some expressions in the document texts and so I'm not really happy with that solution.

Since I can't change the parser, I was wondering if there exists some possibility to map supplementary characters to characters within the BMP that are likely to have a glyph with similar appearance, if existent.

Of course I welcome any other idea. It would even be fine to replace the supplementary characters with some kind of placeholder and then put the original character back in but this seems error prone. Better ideas?

Edit: Here is some - hopefully - minimal example of how this issue comes up with VTD-XML:

@Test
public void parseUnicodeBeyondBMP() throws NavException, FileNotFoundException, IOException, EncodingException, EOFException, EntityException, ParseException {
    // character codpoint 0x10400
    String unicode = "<supplementary>\uD801\uDC00</supplementary>";
    byte[] unicodeBytes = unicode.getBytes();
    assertEquals(unicode, new String(unicodeBytes, "UTF-8"));

    VTDGen vg = new VTDGen();
    vg.setDoc(unicodeBytes);
    vg.parse(false);
    VTDNav vn = vg.getNav();
    long fragment = vn.getContentFragment();
    int offset = (int) fragment;
    int length = (int) (fragment >> 32);
    String originalBytePortion = new String(Arrays.copyOfRange(unicodeBytes, offset, offset+length));
    String vtdString = vn.toRawString(offset, length);
    // this actually succeeds
    assertEquals("\uD801\uDC00", originalBytePortion);
    // this fails ;-( the returned character is Ѐ, codepoint 0x400, thus the high surrogate is missing
    assertEquals("\uD801\uDC00", vtdString);
}