How to remove extra empty lines from XML file?

2019-01-22 09:26发布

In short; i have many empty lines generated in an XML file, and i am looking for a way to remove them as a way of leaning the file. How can i do that ?

For detailed explanation; I currently have this XML file :

<recent>
  <paths>
    <path>path1</path>
    <path>path2</path>
    <path>path3</path>
    <path>path4</path>
  </paths>
</recent>

And i use this Java code to delete all tags, and add new ones instead :

public void savePaths( String recentFilePath ) {
    ArrayList<String> newPaths = getNewRecentPaths();
    Document recentDomObject = getXMLFile( recentFilePath );  // Get the <recent> element.
    NodeList pathNodes = recentDomObject.getElementsByTagName( "path" );   // Get all <path> nodes.

    //1. Remove all old path nodes :
        for ( int i = pathNodes.getLength() - 1; i >= 0; i-- ) { 
            Element pathNode = (Element)pathNodes.item( i );
            pathNode.getParentNode().removeChild( pathNode );
        }

    //2. Save all new paths :
        Element pathsElement = (Element)recentDomObject.getElementsByTagName( "paths" ).item( 0 );   // Get the first <paths> node.

        for( String newPath: newPaths ) {
            Element newPathElement = recentDomObject.createElement( "path" );
            newPathElement.setTextContent( newPath );
            pathsElement.appendChild( newPathElement );
        }

    //3. Save the XML changes :
        saveXMLFile( recentFilePath, recentDomObject ); 
}

After executing this method a number of times i get an XML file with right results, but with many empty lines after the "paths" tag and before the first "path" tag, like this :

<recent>
  <paths>





    <path>path5</path>
    <path>path6</path>
    <path>path7</path>
  </paths>
</recent>

Anyone knows how to fix that ?

------------------------------------------- Edit: Add the getXMLFile(...), saveXMLFile(...) code.

public Document getXMLFile( String filePath ) { 
    File xmlFile = new File( filePath );

    try {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document domObject = db.parse( xmlFile );
        domObject.getDocumentElement().normalize();

        return domObject;
    } catch (Exception e) {
        e.printStackTrace();
    }

    return null;
}

public void saveXMLFile( String filePath, Document domObject ) {
    File xmlOutputFile = null;
    FileOutputStream fos = null;

    try {
        xmlOutputFile = new File( filePath );
        fos = new FileOutputStream( xmlOutputFile );
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
        transformer.setOutputProperty( "{http://xml.apache.org/xslt}indent-amount", "2" );
        DOMSource xmlSource = new DOMSource( domObject );
        StreamResult xmlResult = new StreamResult( fos );
        transformer.transform( xmlSource, xmlResult );  // Save the XML file.
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (TransformerConfigurationException e) {
        e.printStackTrace();
    } catch (TransformerException e) {
        e.printStackTrace();
    } finally {
        if (fos != null)
            try {
                fos.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
    }
}

8条回答
ら.Afraid
2楼-- · 2019-01-22 10:05

I faced the same problem, and I had no idea for the long time, but now, after this Brad's question and his own answer on his own question, I figured out where is the trouble.

I have to add my own answer, because Brad's one isn't really perfect, how Isaac said:

I wouldn't be a huge fan of blindly removing child nodes without knowing what they are

So, better "solution" (quoted because it is more likely workaround) is:

pathsElement.setTextContent("");

This completely removes useless blank lines. It is definitely better than removing all the child nodes. Brad, this should work for you too.

But, this is an effect, not the cause, and we got how to remove this effect, not the cause.

Cause is: when we call removeChild(), it removes this child, but it leaves indent of removed child, and line break too. And this indent_and_like_break is treated as a text content.

So, to remove the cause, we should figure out how to remove child and its indent. Welcome to my question about this.

查看更多
戒情不戒烟
3楼-- · 2019-01-22 10:07

First, an explanation of why this happens — which might be a bit off since you didn't include the code that is used to load the XML file into a DOM object.

When you read an XML document from a file, the whitespaces between tags actually constitute valid DOM nodes, according to the DOM specification. Therefore, the XML parser treats each such sequence of whitespaces as DOM nodes (of type TEXT);

To get rid of it, there are three approaches I can think of:

  • Associate the XML with a schema, and then use setValidating(true) along with setIgnoringElementContentWhitespace(true) on the DocumentBuilderFactory.

    (Note: setIgnoringElementContentWhitespace will only work if the parser is in validating mode, which is why you must use setValidating(true))

  • Write an XSL to process all nodes, filtering out whitespace-only TEXT nodes.
  • Use Java code to do this: use XPath to find all whitespace-only TEXT nodes, iterate through them and remove each one from its parent (using getParentNode().removeChild()). Something like this would do (doc would be your DOM document object):

    XPath xp = XPathFactory.newInstance().newXPath();
    NodeList nl = (NodeList) xp.evaluate("//text()[normalize-space(.)='']", doc, XPathConstants.NODESET);
    
    for (int i=0; i < nl.getLength(); ++i) {
        Node node = nl.item(i);
        node.getParentNode().removeChild(node);
    }
    
查看更多
登录 后发表回答