About parsing html and extract data using shell [d

2019-02-20 13:34发布

This question already has an answer here:

I need to parse a html and extract 4 parts of the html using shell script. However, I am quite new to shell. I just start by a for loop to cat $1 to look through each line of the html. Can anybody help me or give me advice?

4条回答
够拽才男人
2楼-- · 2019-02-20 14:14

HTML-XML-utils

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

For more examples, check the .

查看更多
乱世女痞
3楼-- · 2019-02-20 14:18

ex/vim

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching html tags or edit the content in-place.

Example removing style tag from the header and print the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).

See also:

查看更多
Explosion°爆炸
4楼-- · 2019-02-20 14:22

pup/xpup

Use pup utility to parse HTML from the command line using CSS selectors, or xpup to parse HTML/XML using XPath.

For example:

$ curl -sL http://example.com/ | pup title
<title>
 Example Domain
</title>

or:

$ pup -f <(curl -sL http://example.com/) div p a attr{href}
http://www.iana.org/domains/example

or:

$ xpup -f <(curl -sL http://example.com/) "html/body/div/h1"
Example Domain
查看更多
Explosion°爆炸
5楼-- · 2019-02-20 14:28

BSD/GNU grep/ripgrep

For simple extracting, you can use grep, for example:

  • Extracting outer html of H1:

    $ curl -s http://example.com/ | grep -o '<h1>.*</h1>'
    <h1>Example Domain</h1>
    
  • Extracting the body:

    $ curl -s http://example.com/ | xargs | grep -o '<body>.*</body>'
    <body> <div> <h1>Example Domain</h1> ...
    

    Instead of xargs you can also use tr '\n' ' '.

  • For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

查看更多
登录 后发表回答