How can I extract data from HTML tables in Perl?

Possible duplicate:
Can you provide an example of parsing HTML with your favorite parser?
How can I extract content from HTML files using Perl?

I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.

Any help would be greatly apreciated!

Ok sorry... I've tried many regular expressions such as:

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

My current program looks like so:

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2\n";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: \n";
    print $1;
    print "the word was: $1.\n";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED\n";
    }
} else { 
print "THIS FAILED\n";
}

close(INPUT_FILE);
close(OUTPUT_FILE);

标签： html perl parsing

3条回答

聊天终结者

2楼-- · 2020-06-04 13:20

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

0人赞添加讨论(0) 举报

何必那么认真

3楼-- · 2020-06-04 13:36

Use HTML::TableExtract. Really.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TableExtract;
use LWP::Simple;

my $file = 'Table3.htm';
unless ( -e $file ) {
    my $rc = getstore(
        'http://www.ntsb.gov/aviation/Table3.htm',
        $file);
    die "Failed to download document\n" unless $rc == 200;
}

my @headers = qw( Year Fatalities );

my $te = HTML::TableExtract->new(
    headers => \@headers,
    attribs => { id => 'myTable' },
);

$te->parse_file($file);

my ($table) = $te->tables;

print join("\t", @headers), "\n";

for my $row ($te->rows ) {
    print join("\t", @$row), "\n";
}

This is what I meant in another post by "task-specific" HTML parsers.

You could have saved a lot of time by directing your energy to reading some documentation rather than throwing regexes at the wall and seeing if any stuck.

0人赞添加讨论(0) 举报

混吃等死

4楼-- · 2020-06-04 13:40

That's an easy one:

my $html = '<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>';
my @stuff = $html =~ />([^<]+)</g;
print join (", ", @stuff), "\n";

See http://codepad.org/qz9d5Bro if you want to try running it.

0人赞添加讨论(0) 举报

How can I extract data from HTML tables in Perl?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间