I am parsing an xml file using xmllint. Theres an element description
in each <item>
with CDATA text inside from which I would like to extract the title (text until forst <br />
) and the URL of a specific domain (desiredURL.com). I am not a pro in regeular expression and the use of awk
and sed
. Is there a way to parse the data in the description
element using xmllint again or what would be an appropriate approach? I want to iterate over all the <item>
and print the title and the url of the domain desiredURL.com
#!/bin/bash
ITEMS=`echo "cat //item/description/text()" | xmllint --shell file.xml | egrep '^\w'`
#iterate over items and print title and desiredURL
file.xml:
<item>
<description><![CDATA[A title for the URLs<br /><br />
http://www.foobar.com/foo/bar
<br />http://bar.com/foo
<br />http://myurl.com/foo
<br />http://desiredURL.com/files/ddd
<br />http://asdasd.com/onefile/g.html
<br />http://second.com/link
<br />]]></description>
</item>
<description> ...</description>
<item>
</item>