Get <td> values with Perl

So I have a reporting tool that spits out job scheduling statistics in an HTML file, and I'm looking to consume this data using Perl. I don't know how to step through a HTML table though.

I know how to do this with jQuery using

$.find('<tr>').each(function(){
  variable = $(this).find('<td>').text
});

But I don't know how to do this same logic with Perl. What should I do? Below is a sample of the HTML output. Each table row includes the three same stats: object name, status, and return code.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<HTML>
<HEAD>
<meta name="GENERATOR" content="UC4 Reporting Tool V8.00A">
<Title></Title>
<style type="text/css">
th,td {
font-family: arial;
font-size: 0.8em;
}

th {
background: rgb(77,148,255);
color: white;
}

td {
border: 1px solid rgb(208,213,217);
}  

table {
border: 1px solid grey; 
background: white;
}

body {
background: rgb(208,213,217);
}
</style>
</HEAD>
<BODY>
<table>
<tr>
  <th>Object name</th>
  <th>Status</th>
  <th>Return code</th>
</tr>
<tr>
  <td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>
<tr>
  <td>JOBS.UNIX.ADMIN.INFA_CHK_REP_SERVICE</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>
<tr>
  <td>JOBS.UNIX.S_SITEVIEW.WF_M_SITEVIEW_CHK_FACILITIES_REGISTRY</td>
  <td>ENDED_OK - ended normally</td>
  <td>0</td>
</tr>

标签： perl parsing

5条回答

仙女界的扛把子

2楼-- · 2019-06-26 00:09

Perl CPAN module HTML::TreeBuilder.

I use it extensively to parse a lot of HTML documents.

The concept is that you get an HTML::Element (the root node by example). From it, you can look for other nodes:

Get a list of children nodes with ->content_list()
Get the parent node with ->parent()

Disclaimer: The following code has not been tested, but it's the idea.

my $root = HTML::TreeBuilder->new;
$root->utf8_mode(1);
$root->parse($content);
$root->eof();
# This gets you an HTML::Element, of the root document
$root->elementify();

my @td = $root->look_down("_tag", "td");
foreach my $td_elem (@td)
{
    printf "-> %s\n", $td_elem->as_trimmed_text();
}

If your table is more complex than that, you could first find the TABLE element, then iterate over each TR children, and for each TR children, iterate over TD elements...

http://metacpan.org/pod/HTML::TreeBuilder

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2019-06-26 00:12

The HTML::Query module is a wrapper around the HTML parser that provides a querying interface that is familiar to jQuery users. So you could write something like

use HTML::Query qw(Query);
my $docName = "test.html";
my $doc = Query(file => $docName);

for my $tr ($doc->query("td")) {
  for my $td (Query($tr)->query("td")) {
    # $td is now an HTML::Element object for the td element
    print $td->as_text, "\n";
  }
}

Read the HTML::Query documentation to get a better idea of how to use it--- the above is hardly the prettiest example.

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2019-06-26 00:17

You could use a RegExp but Perl already has modules built for this specific task. Check out HTML::TableContentParser

You would probably do something like this:

use HTML::TableContentParser;

$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);

foreach $table (@$tables) {
  foreach $row (@{ $tables->{rows} }) {
    foreach $col (@{ $row->{cols} }) {
      # each <td>
      $data = $col->{data};
    }
  }
}

0人赞添加讨论(0) 举报

劫难

5楼-- · 2019-06-26 00:26

Here I use the HTML::Parser, is a little verbose, but guaranteed to work. I am using the diamond operator so, you can use it as a filter. If you call this Perl source extractTd, here are a couple of ways to call it.

$ extractTd test.html

$ extractTd < test.html

will both work, output will go on standard output and you can redirect it to a file.

#!/usr/bin/perl -w

use strict;

package ExtractTd;
use 5.010;
use base "HTML::Parser";

my $td_flag = 0;

sub start {
    my ($self, $tag, $attr, $attrseq, $origtext) = @_; 
    if ($tag =~ /^td$/i) {
        $td_flag = 1;
    }   
}

sub end {
    my ($self, $tag, $origtext) = @_; 
    if ($tag =~ /^td$/i) {
        $td_flag = 0;
    }   
}

sub text {
    my ($self, $text) = @_; 
    if ($td_flag) {
        say $text;
    }   
}

my $extractTd = new ExtractTd;
while (<>) {
    $extractTd->parse($_);
}
$extractTd->eof;

0人赞添加讨论(0) 举报

贼婆χ

6楼-- · 2019-06-26 00:32

Have you tried looking at cpan for HTML libraries? This seems to do what your wanting http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm

Also here is a whole page of different HTML related libraries to use http://search.cpan.org/search?m=all&q=html+&s=1&n=100

0人赞添加讨论(0) 举报

Get values with Perl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间