Java: skip binary data in xml file while parsing

2020-05-07 04:36发布

问题:

I want to parse a xml file in java which contains binary data: here is an example of the xml file:

<?xml version="1.0" encoding="utf-8"?>
<documents>
  <document>
    <element name="docid">
      <value><![CDATA[0902307e8004c74c]]></value>
    </element>
    <element name="published">
      <value><![CDATA[2012-01-01T00:00:00]]></value>
    </element>
    <element name="documenttype">
      <value><![CDATA[Circular]]></value>
    </element>
    <element name="data">
      <value><![CDATA[%PDF-1.6
%����
1020 0 obj
<</Filter/FlateDecode/First 20/Length 270/N 3/Type/ObjStm>>stream
�o^���)|�,�Ypoef�
l���o�>����u���b"Cb�|���%&��D�yD��q�q�q�q�q��%_ja�LJob��/��3"=����o���]V11}�    }a�+'6@����C�,^}�d%�۠�`s��q��5�׷^(�N��{S<S�����A��������-������f\ڌ��|U/݌�z���f�I9����g�g���s���0z'��X~
endstream
endobj
startxref
55097
%%EOF
]]></value>
    </element>
    <element name="dataname">
      <value><![CDATA[sdfsfsfsdsdfsd.pdf]]></value>
    </element>
  </document>
</documents>

Normally I would parse such an xml file that way:

Document doc = null;
DocumentBuilder documentBuilder = null;
documentBuilderFactory = DocumentBuilderFactory.newInstance();
        try {
            documentBuilder = documentBuilderFactory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
try {

            doc = documentBuilder.parse(fastXMLFile);

        } catch (SAXException e) {
            System.out.println("SAXExept");
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println("Test");
            return;
        }

But because of the "data" element which contains binary data, the debugger tells me:

[Fatal Error] xmlfile.xml:58:10: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.
SAXExept
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.

I dont need to parse this data field by now, I could just skip it. I just want to parse the rest of the data. Is this possible?

回答1:

Since your XML includes invalid characters (as the exception shows), you can't expect libraries to parse it successfully. Since you can't change the XML file creation process, and since you can't see the code of the search engine, I believe the easiest for you will be to remove the Invalid characters from the XML.

so the process would be:

1- read the contents of the XML into a String

2- parse the String and remove all Invalid Charachters

3- write the String back into the file. or create a new file if you can't modify the original

4- parse the modified/new file.

In order to replace invalid characters, see the following link as it also includes a method to do so.

Invalid XML Characters: when valid UTF8 does not mean valid XML.



回答2:

You XML document is invalid. PDF data should be base64 encoded or HEX. I don't think there is a solution except changing your document.

Regards