Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.
Here is a python script which uses
lxml
for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.Example 1
lxml
can be installed withpip install lxml
. On ubuntu you can usesudo apt install python-lxml
.Usage
lxml
also accepts a URL as input:Example 2
A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (
module
) is prefixed with the default namespace{http://maven.apache.org/POM/4.0.0}
:pom.xml:
module_extractor.py:
Yuzem's method can be improved by inversing the order of the
<
and>
signs in therdom
function and the variable assignments, so that:becomes:
If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the
while
loop.Command-line tools that can be called from shell scripts include:
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
This is sufficient...
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
or XMLStarlet. To install it on opensuse use:
or try
cnf xml
on other platforms.Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.