Get valid URL from string using Bash script

2019-08-29 10:19发布

问题:

I am parsing an xml file using xmllint. Theres an element description in each <item> with CDATA text inside from which I would like to extract the title (text until forst <br />) and the URL of a specific domain (desiredURL.com). I am not a pro in regeular expression and the use of awk and sed. Is there a way to parse the data in the description element using xmllint again or what would be an appropriate approach? I want to iterate over all the <item> and print the title and the url of the domain desiredURL.com

#!/bin/bash
ITEMS=`echo "cat  //item/description/text()" | xmllint --shell  file.xml  | egrep '^\w'`
#iterate over items and print title and desiredURL


file.xml:

<item>
    <description><![CDATA[A title for the URLs<br /><br />

    http://www.foobar.com/foo/bar
    <br />http://bar.com/foo
    <br />http://myurl.com/foo
    <br />http://desiredURL.com/files/ddd
    <br />http://asdasd.com/onefile/g.html
    <br />http://second.com/link
    <br />]]></description> 



    </item>
<description> ...</description>
    <item>
</item>

回答1:

XMLlint

There is an --xpath option you can use to pass an XPath.

Extracting URL

Assuming your URLs are not followed by anything on each line, you can use grep with :

  • -P flag: Perl regular expression (PCRE) ;
  • -o flag: only print the matched (non-empty) parts.

Command

xmllint --xpath '//item/description' /tmp/so.xml | grep -Po 'http:.*' 


标签: xml bash shell