Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
Ideally, what I would like to be able to do is:
cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
The first call to
read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like:ENTITY=tag
andCONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like:ENTITY=/tag
andCONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.Now his while loop cleaned up a bit to match the above:
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the
read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).Now given the following (similar to what you get from listing a bucket on S3) for
input.xml
:and the following loop:
You should get:
So if we wrote a
while
loop like Yuzem's:We'd get a listing of all the files in the S3 bucket.
EDIT If for some reason
local IFS=\>
doesn't work for you and you set it globally, you should reset it at the end of the function like:Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2 To split out attribute name/value pairs you can augment the
read_dom()
like so:Then write your function to parse and get the data you want like this:
Then while you
read_dom
callparse_dom
:Then given the following example markup:
You should get this output:
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
I don't see any reason why that shouldn't work
I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.
My XML::Twig Perl module comes with such a tool:
xml_grep
, where you would probably write what you want asxml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt
(the-t
option gives you the result as text instead of xml)Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.
The title can be read like:
And it also has a cool feature to export multiple variables to bash. For example
sets
$title
to the title and$imgcount
to the number of images in the file, which should be as flexible as parsing it directly in bash.After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:
starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.
The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.
First, a definition of the UML words used in this post:
EDIT: updated functions, with handle of:
The functions, first is the xml_read_dom which's called recursively by xml_read:
and the second one :
and lastly, the rtrim, trim and echo2 (to stderr) functions:
Colorization:
oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:
How to load all that stuff:
Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)
If not, just copy/paste everything on the command line.
How does it work:
With Debug mode (-d) comments and parsed attributes are printed to stderr
Introduction
Thank you very much for the earlier answers. The question headline is very ambiguous as the questionnaire asks for how to parse
xml
when what the questionnaire actually wants to parsexhtml
, talk about ambiguity. Though they are similar they are definately not the same. And sincexml
andxhtml
isn't the same it was very hard to come up with a solution that's exactly for what the questionnaire asked for. However I hope the solution below still will do. I want to admit I couldn't find out how to look specifically for/html/head/title
. Now when that's been written about, I want to say that I'm not satisfied with the answers earlier on since some of the answerers are reinventing the wheel unnecessarily when the questionnaire didn't say that it's forbidden to download a package. I don't understand the unnecessary coding at all. I specifically want to repeat what a person in this thread already said: Just because you can write your own parser, doesn't mean you should - @Stephen Niedzielski. Regarding programming: the easiest and shortest way is by rule to prefer, never make anything more complex than ever needed. The solution has been tested with good result on Windows 10 > Windows Subsystem for Linux > Ubuntu. It's possible if anothertitle
element would exist and be selected, it would be a bad result, sorry for that possiblity. Example: if the<body>
tags come before the<head>
tags and the<body>
tags contain a<title>
tag, but that's very, very unlikely.TLDR/Solution
On general path for solution, thank you @Grisha, @Nat, How to parse XML in Bash?
On removing xml tags, thank you @Johnsyweb, How to remove XML tags from Unix command line?
1. Install the "package"
xmlstarlet
2. Execute in bash
xmlstarlet sel -t -m "//_:title" -c . -n xhtmlfile.xhtml | head -1 | sed -e 's/<[^>]*>//g' > titleOfXHTMLPage.txt