So I have a reporting tool that spits out job scheduling statistics in an HTML file, and I'm looking to consume this data using Perl. I don't know how to step through a HTML table though.
I know how to do this with jQuery using
$.find('<tr>').each(function(){
variable = $(this).find('<td>').text
});
But I don't know how to do this same logic with Perl. What should I do? Below is a sample of the HTML output. Each table row includes the three same stats: object name, status, and return code.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<meta name="GENERATOR" content="UC4 Reporting Tool V8.00A">
<Title></Title>
<style type="text/css">
th,td {
font-family: arial;
font-size: 0.8em;
}
th {
background: rgb(77,148,255);
color: white;
}
td {
border: 1px solid rgb(208,213,217);
}
table {
border: 1px solid grey;
background: white;
}
body {
background: rgb(208,213,217);
}
</style>
</HEAD>
<BODY>
<table>
<tr>
<th>Object name</th>
<th>Status</th>
<th>Return code</th>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.ADMIN.INFA_CHK_REP_SERVICE</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
<tr>
<td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
<td>ENDED_OK - ended normally</td>
<td>0</td>
</tr>
Perl CPAN module HTML::TreeBuilder.
I use it extensively to parse a lot of HTML documents.
The concept is that you get an HTML::Element (the root node by example). From it, you can look for other nodes:
Disclaimer: The following code has not been tested, but it's the idea.
If your table is more complex than that, you could first find the TABLE element, then iterate over each TR children, and for each TR children, iterate over TD elements...
http://metacpan.org/pod/HTML::TreeBuilder
The HTML::Query module is a wrapper around the HTML parser that provides a querying interface that is familiar to jQuery users. So you could write something like
Read the HTML::Query documentation to get a better idea of how to use it--- the above is hardly the prettiest example.
You could use a RegExp but Perl already has modules built for this specific task. Check out HTML::TableContentParser
You would probably do something like this:
Here I use the HTML::Parser, is a little verbose, but guaranteed to work. I am using the diamond operator so, you can use it as a filter. If you call this Perl source extractTd, here are a couple of ways to call it.
or
will both work, output will go on standard output and you can redirect it to a file.
Have you tried looking at cpan for HTML libraries? This seems to do what your wanting http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm
Also here is a whole page of different HTML related libraries to use http://search.cpan.org/search?m=all&q=html+&s=1&n=100