Replace an XML element's value? Sed regular ex

2020-07-14 05:48发布

问题:

I want to take an XML file and replace an element's value. For example if my XML file looks like this:

<abc>
    <xyz>original</xyz>
</abc>

I want to replace the xyz element's original value, whatever it may be, with another string so that the resulting file looks like this:

<abc>
    <xyz>replacement</xyz>
</abc>

How would you do this? I know I could write a Java program to do this but I assume that that's overkill for replacing a single element's value and that this could be easily done using sed to do a substitution using a regular expression. However I'm less than novice with that command and I'm hoping some kind soul reading this will be able to spoon feed me the correct regular expression for the job.

One idea is to do something like this:

sed s/\<xyz\>.*\<\\xyz\>/\<xyz\>replacement\<\\xyz\>/ <original.xml >new.xml

Maybe it's better for me to just replace the entire line of the file with what I want it to be, since I will know the element name and the new value I want to use? But this assumes that the element in question is on a single line and that no other XML data is on the same line. I'd rather have a command which will basically replace element xyz's value with a new string that I specify and not have to worry if the element is all on one line or not, etc.

If sed is not the best tool for this job then please dial me in to a better approach.

If anyone can steer me in the right direction I'll really appreciate it, you'll probably save me hours of trial and error. Thanks in advance!

--James

回答1:

sed is not going to be a easy tool to use for multi-line replacements. It's possible to implement them using its N command and some recursion, checking after reading in each line if the close of the tag has been found... but it's not pretty and you'll never remember it.

Of course, actually parsing the xml and replacing tags is going to be the safest thing, but if you know you won't run into any problems, you could try this:

perl -p -0777 -e 's@<xyz>.*?</xyz>@<xyz>new-value</xyz>@sg' <xml-file>

Breaking this down:

  • -p tells it to loop through the input and print
  • -0777 tells it to use the end of file as the input separator, so that it gets the whole thing in in one slurp
  • -e means here comes the stuff I want you to do

And the substitution itself:

  • use @ as a delimiter so you don't have to escape /
  • use *?, the non-greedy version, to match as little as possible, so we don't go all the way to the last occurrence of </xyz> in the file
  • use the s modifier to let . match newlines (to get the multiple-line tag values)
  • use the g modifier to match the pattern multiple times

Tada! This prints the result to stdout - once you verify it does what you want, add the -i option to tell it to edit the file in place.



回答2:

OK so I bit the bullet and took the time to write a Java program which does what I want. Below is the operative method called by my main() method which does the work, in case this will be helpful to someone else in the future:

/**
 * Takes an input XML file, replaces the text value of the node specified by an XPath parameter, and writes a new
 * XML file with the updated data.
 * 
 * @param inputXmlFilePathName
 * @param outputXmlFilePathName
 * @param elementXpath
 * @param elementValue
 * @param replaceAllFoundElements
 */
public static void replaceElementValue(final String inputXmlFilePathName,
                                       final String outputXmlFilePathName,
                                       final String elementXpathExpression,
                                       final String elementValue,
                                       final boolean replaceAllFoundElements)
{
    try
    {
        // get the template XML as a W3C Document Object Model which we can later write back as a file
        InputSource inputSource = new InputSource(new FileInputStream(inputXmlFilePathName));
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        Document document = documentBuilderFactory.newDocumentBuilder().parse(inputSource);

        // create an XPath expression to access the element's node
        XPathFactory xpathFactory = XPathFactory.newInstance();
        XPath xpath = xpathFactory.newXPath();
        XPathExpression xpathExpression = xpath.compile(elementXpathExpression);

        // get the node(s) which corresponds to the XPath expression and replace the value
        Object xpathExpressionResult = xpathExpression.evaluate(document, XPathConstants.NODESET);
        if (xpathExpressionResult == null)
        {
            throw new RuntimeException("Failed to find a node corresponding to the provided XPath.");
        }
        NodeList nodeList = (NodeList) xpathExpressionResult;
        if ((nodeList.getLength() > 1) && !replaceAllFoundElements)
        {
            throw new RuntimeException("Found multiple nodes corresponding to the provided XPath and multiple replacements not specified.");
        }
        for (int i = 0; i < nodeList.getLength(); i++)
        {
            nodeList.item(i).setTextContent(elementValue);
        }

        // prepare the DOM document for writing
        Source source = new DOMSource(document);

        // prepare the output file
        File file = new File(outputXmlFilePathName);
        Result result = new StreamResult(file);

        // write the DOM document to the file
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(source, result);
    }
    catch (Exception ex)
    {
        throw new RuntimeException("Failed to replace the element value.", ex);
    }
}

I run the program like so:

$ java -cp xmlutility.jar com.abc.util.XmlUtility input.xml output.xml '//name/text()' JAMES


回答3:

I hate to be a naysayer, but XML is anything but regular. A regular expression will probably be more trouble than what it worth. See here for more insight: Using C# Regular expression to replace XML element content

Your thought of a simple Java program might be nice after all. An XSLT transform may be easier if you know XSLT pretty well. If you know Perl ... that's the way to go IMHO.

Having said that, if you choose to go with a Regex and your version of sed supports extended regular expressions, you can make it multiline with /g. In other words, put /g at the end of the regex and it will match your pattern even if they're on multiple lines.

Also. the Regex you proposed is "greedy". It will grab the biggest group of characters it can because the "." will match from the first occurrence of to the last . You can make it "lazy" by changing the wildcard to ".?". Putting the question mark after the asterisk will tell it to match only one set of to .



回答4:

I was trying to do the same thing and came across this [gu]awk script that achieves it.

BEGIN { FS = "[<|>]" }
{
    if ($2 == "xyz") {
        sub($3, "replacement")      
    }
    print
}


标签: xml regex sed