About parsing html and extract data using shell [d

This question already has an answer here:

Bash parse HTML 4 answers

I need to parse a html and extract 4 parts of the html using shell script. However, I am quite new to shell. I just start by a for loop to cat $1 to look through each line of the html. Can anybody help me or give me advice?

标签： html parsing shell

4条回答

够拽才男人

2楼-- · 2019-02-20 14:14

`HTML-XML-utils`

You may use htmlutils for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:

$ curl -s http://example.com/ | hxselect title
<title>Example Domain</title>

For more examples, check the html-xml-utils.

0人赞添加讨论(0) 举报

乱世女痞

3楼-- · 2019-02-20 14:18

`ex`/`vim`

For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching html tags or edit the content in-place.

Example removing style tag from the header and print the parsed output:

$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin

However it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).

`pup`/`xpup`

Use pup utility to parse HTML from the command line using CSS selectors, or xpup to parse HTML/XML using XPath.

For example:

$ curl -sL http://example.com/ | pup title
<title>
 Example Domain
</title>

or:

$ pup -f <(curl -sL http://example.com/) div p a attr{href}
http://www.iana.org/domains/example

or:

$ xpup -f <(curl -sL http://example.com/) "html/body/div/h1"
Example Domain

0人赞添加讨论(0) 举报

Explosion°爆炸

5楼-- · 2019-02-20 14:28

BSD/GNU `grep`/`ripgrep`

For simple extracting, you can use grep, for example:

Extracting outer html of H1:

$ curl -s http://example.com/ | grep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

Extracting the body:

$ curl -s http://example.com/ | xargs | grep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

^{Instead of xargs you can also use tr '\n' ' '.}

For multiple tags, see: Text between two tags.

If you're dealing with large datasets, consider using ripgrep which has similar syntax, but it's a way faster since it's written in Rust.

0人赞添加讨论(0) 举报