Read escaped quote as escaped quote from xml

2019-04-15 20:14发布

问题:

I load xml file into DOM model and analyze it.

The code for that is:

public class MyTest {
public static void main(String[] args) {        
    Document doc = XMLUtils.fileToDom("MyTest.xml");//Loads xml data to DOM
    Element rootElement = doc.getDocumentElement();
    NodeList nodes = rootElement.getChildNodes();
    Node child1 = nodes.item(1);
    Node child2 = nodes.item(3);
    String str1 = child1.getTextContent();
    String str2 = child2.getTextContent();      
    if(str1 != null){
        System.out.println(str1.equals(str2));
    }
    System.out.println();
    System.out.println(str1);
    System.out.println(str2);
}   

}

MyTest.xml

<tests>
   <test name="1">ff1 &quot;</test>
   <test name="2">ff1 "</test>
</tests>

Result:

true

ff1 "
ff1 "

Desired result:

false

ff1 &quot;
ff1 "

So I need to distinguish these two cases: when the quote is escaped and is not.

Please help.

Thank you in advance.

P.S. The code for XMLUtils#fileToDom(String filePath), a snippet from XMLUtils class:

static {
    DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
    dFactory.setNamespaceAware(false);
    dFactory.setValidating(false);
    try {
        docNonValidatingBuilder = dFactory.newDocumentBuilder();
    } catch (ParserConfigurationException e) {
    }
}

public static DocumentBuilder getNonValidatingBuilder() {
    return docNonValidatingBuilder;
}

public static Document fileToDom(String filePath) {

    Document doc = getNonValidatingBuilder().newDocument();
    File f = new File(filePath);
    if(!f.exists())
        return doc;

    try {
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        DOMResult result = new DOMResult(doc);
        StreamSource source = new StreamSource(f);
        transformer.transform(source, result);
    } catch (Exception e) {
        return doc;
    }

    return doc;

}

回答1:

I've take a look on source code of apache xerces and propose my solution (but it is monkey patch). I've wrote simple class

package a;
import java.io.IOException;
import org.apache.xerces.impl.XMLDocumentScannerImpl;
import org.apache.xerces.parsers.NonValidatingConfiguration;
import org.apache.xerces.xni.XMLString;
import org.apache.xerces.xni.XNIException;
import org.apache.xerces.xni.parser.XMLComponent;

public class MyConfig extends NonValidatingConfiguration {

    private MyScanner myScanner;

    @Override
    @SuppressWarnings("unchecked")
    protected void configurePipeline() {
        if (myScanner == null) {
            myScanner = new MyScanner();
            addComponent((XMLComponent) myScanner);
        }
        super.fProperties.put(DOCUMENT_SCANNER, myScanner);
        super.fScanner = myScanner;
        super.fScanner.setDocumentHandler(this.fDocumentHandler);
        super.fLastComponent = fScanner;
    }

    private static class MyScanner extends XMLDocumentScannerImpl {

        @Override
        protected void scanEntityReference() throws IOException, XNIException {
            // name
            String name = super.fEntityScanner.scanName();
            if (name == null) {
                reportFatalError("NameRequiredInReference", null);
                return;
            }

            super.fDocumentHandler.characters(new XMLString(("&" + name + ";")
                .toCharArray(), 0, name.length() + 2), null);

            // end
            if (!super.fEntityScanner.skipChar(';')) {
                reportFatalError("SemicolonRequiredInReference",
                        new Object[] { name });
            }
            fMarkupDepth--;
        }
    }

}

You need to add only next line to your main method before start parsing

System.setProperty(
            "org.apache.xerces.xni.parser.XMLParserConfiguration",
            "a.MyConfig");

And you will have expected result:

false

ff1 &quot;
ff1 "


回答2:

Looks like you can get the TEXT_NODE child and use getNodeValue (assuming it's not NULL):

public static String getRawContent(Node n) {
  if (n == null) {
      return null;
  }

  Node n1 = getChild(n, Node.TEXT_NODE);

  if (n1 == null) {
      return null;
  }

  return n1.getNodeValue();
}

Grabbed that from: http://www.java2s.com/Code/Java/XML/Gettherawtextcontentofanodeornullifthereisnotext.htm



回答3:

There is no way to do this for the internal entities. XML does not support this concept. Internal entities are just a different way to write the same PSVI content into the text, they are not distinctive.