Java: skip binary data in xml file while parsing

I want to parse a xml file in java which contains binary data: here is an example of the xml file:

<?xml version="1.0" encoding="utf-8"?>
<documents>
  <document>
    <element name="docid">
      <value><![CDATA[0902307e8004c74c]]></value>
    </element>
    <element name="published">
      <value><![CDATA[2012-01-01T00:00:00]]></value>
    </element>
    <element name="documenttype">
      <value><![CDATA[Circular]]></value>
    </element>
    <element name="data">
      <value><![CDATA[%PDF-1.6
%����
1020 0 obj
<</Filter/FlateDecode/First 20/Length 270/N 3/Type/ObjStm>>stream
�o^���)|�,�Ypoef�
l���o�>����u���b"Cb�|���%&��D�yD��q�q�q�q�q��%_ja�LJob��/��3"=����o���]V11}�    }a�+'6@����C�,^}�d%�۠�`s��q��5�׷^(�N��{S<S�����A��������-������f\ڌ��|U/݌�z���f�I9����g�g���s���0z'��X~
endstream
endobj
startxref
55097
%%EOF
]]></value>
    </element>
    <element name="dataname">
      <value><![CDATA[sdfsfsfsdsdfsd.pdf]]></value>
    </element>
  </document>
</documents>

Normally I would parse such an xml file that way:

Document doc = null;
DocumentBuilder documentBuilder = null;
documentBuilderFactory = DocumentBuilderFactory.newInstance();
        try {
            documentBuilder = documentBuilderFactory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
try {

            doc = documentBuilder.parse(fastXMLFile);

        } catch (SAXException e) {
            System.out.println("SAXExept");
            e.printStackTrace();
        } catch (IOException e) {
            System.out.println("Test");
            return;
        }

But because of the "data" element which contains binary data, the debugger tells me:

[Fatal Error] xmlfile.xml:58:10: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.
SAXExept
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x1a) was found in the CDATA section.

I dont need to parse this data field by now, I could just skip it. I just want to parse the rest of the data. Is this possible?

标签： java xml parsing binary

2条回答

等我变得足够好

2楼-- · 2020-05-07 05:21

Since your XML includes invalid characters (as the exception shows), you can't expect libraries to parse it successfully. Since you can't change the XML file creation process, and since you can't see the code of the search engine, I believe the easiest for you will be to remove the Invalid characters from the XML.

so the process would be:

1- read the contents of the XML into a String

2- parse the String and remove all Invalid Charachters

3- write the String back into the file. or create a new file if you can't modify the original

4- parse the modified/new file.

In order to replace invalid characters, see the following link as it also includes a method to do so.

Invalid XML Characters: when valid UTF8 does not mean valid XML.

0人赞添加讨论(0) 举报

Emotional °昔

3楼-- · 2020-05-07 05:31

You XML document is invalid. PDF data should be base64 encoded or HEX. I don't think there is a solution except changing your document.

Regards

0人赞添加讨论(0) 举报

Java: skip binary data in xml file while parsing

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间