I am trying to create a BASH script what would extract the data from HTML table. Below is the example of table from where I need to extract data:
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>
And I want the BASH script to output it like so:
SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s
How to do it?
So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>
, so only lines starting with <tr><td>
will be selected for next parsing (sed).
This is what I used: sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'
But then <tr>
tags still remain and also it wont separate the strings. In other words the result of this script is:
<tr>SAVE_DOCUMENTOK0.406 s</tr>
The full command of the script I'm working on is:
cat $FILENAME | grep "<tr><td>" | sed 's@<\([^<>][^<>]*\)>\([^<>]*\)</\1>@\2@g'
You can use bash
xpath
(XML::XPath perl module) to accomplish that task very easily:Go with
(g)awk
, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.Here you can see it in action: https://ideone.com/zGfLe
Some explanation:
-F
sets the input field separator to a regexp (any oftr
's ortd
's opening or closing tagthen works only on lines that matches those tags AND at least two upercasse fields
then prints the needed fields.
HTH
A solution based on multi-platform web-scraping CLI
xidel
and XQuery:With the sample input, this yields:
Explanation:
XQuery query
for $tr in //tr[position()>1] return join($tr/td, " ")
processes thetr
elements starting with the 2nd one (position()>1
, to skip the header row) in a loop, and joins the values of the childtd
elements ($tr/td
) with a single space as the separator.-s
makesxidel
silent (suppresses output of status information).While
html2text
is convenient for display of the extracted data, providing machine-parseable output is non-trivial, unfortunately:The Awk command removes the hidden
\b
-based (backspace-based) sequences thathtml2text
outputs by default, and parses the lines into fields by|
, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use-v OFS='\t'
).Note: Use of
-nobs
to suppress backspace sequences at the source is not an option, because you then won't be able to distinguish between the hidden-by-default_
instances used for padding and actual_
characters in the data.Note: Given that
html2text
seemingly invariably uses|
as the column separator, the above will only work robustly if the are no|
instances in the data being extracted.You may use
html2text
command and format the columns viacolumn
, e.g.:then parse it further from there (e.g.
cut
,awk
,ex
).In case you'd like to sort it first, you can use
ex
, see the example here or here.You can parse the file using Ex editor (part of Vim) by removing HTML tags, e.g.:
Here is shorter version by printing the whole file without HTML tags:
Explanation:
%s/<[^>]\+>/ /g
- Substitute all HTML tags into empty space.v/0/d
- Deletes all lines without0
.wq! /dev/stdout
- Quits editor and writes the buffer to the standard output.There are a lot of ways of doing this but here's one:
You could use more sed(1) (
-e 's:^ ::'
) instead of thecut -c2-
to remove the leading space but cut(1) doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.
Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.