This question already has an answer here:
- Bash parse HTML 4 answers
I need to parse a html and extract 4 parts of the html using shell script. However, I am quite new to shell. I just start by a for loop to cat $1
to look through each line of the html. Can anybody help me or give me advice?
HTML-XML-utils
You may use
htmlutils
for parsing well-formatted HTML/XML files. The package includes a lot of binary tools to extract or modify the data. For example:For more examples, check the html-xml-utils.
ex
/vim
For more advanced parsing, you may use in-place editors such as ex/vi where you can jump between matching html tags or edit the content in-place.
Example removing style tag from the header and print the parsed output:
However it's not advised to use regex for parsing your html, therefore for long-term approach you should use the appropriate language (such as Python, perl or PHP DOM).
See also:
pup
/xpup
Use
pup
utility to parse HTML from the command line using CSS selectors, orxpup
to parse HTML/XML using XPath.For example:
or:
or:
BSD/GNU
grep
/ripgrep
For simple extracting, you can use
grep
, for example:Extracting outer html of H1:
Extracting the body:
Instead of
xargs
you can also usetr '\n' ' '
.For multiple tags, see: Text between two tags.
If you're dealing with large datasets, consider using
ripgrep
which has similar syntax, but it's a way faster since it's written in Rust.