How to parse XML in Bash?

2018-12-31 07:18发布

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt

15条回答
公子世无双
2楼-- · 2018-12-31 07:19

While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.

Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.

Example 1

#!/usr/bin/env python
import sys
from lxml import etree

tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]

#  a hack allowing to access the
#  default namespace (if defined) via the 'p:' prefix    
#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
#  an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
    ns['p'] = ns.pop(None)
#   end of hack    

for e in tree.xpath(xpath_expression, namespaces=ns):
    if isinstance(e, str):
        print(e)
    else:
        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.


Example 2

A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modules>
        <module>cherries</module>
        <module>bananas</module>
        <module>pears</module>
    </modules>
</project>

module_extractor.py:

from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
    print(e.text)
查看更多
皆成旧梦
3楼-- · 2018-12-31 07:20

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=\> ; read -d \< E C ;}

becomes:

rdom () { local IFS=\< ; read -d \> C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

查看更多
春风洒进眼中
4楼-- · 2018-12-31 07:24

Command-line tools that can be called from shell scripts include:

  • 4xpath - command-line wrapper around Python's 4Suite package
  • XMLStarlet
  • xpath - command-line wrapper around Perl's XPath library
  • Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

查看更多
笑指拈花
5楼-- · 2018-12-31 07:25

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
查看更多
梦该遗忘
6楼-- · 2018-12-31 07:30

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

查看更多
宁负流年不负卿
7楼-- · 2018-12-31 07:34

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

查看更多
登录 后发表回答