Get contents between table tags in everyfile in di

2019-09-09 06:39发布

问题:

I have directory with about 900 html documents in it, each document contains the same table tags (easily defined) in that table is data which I need to extract and output in a csv format. What is the best way to do this and how can I do it?

Here is an example of what is in each html file which I need to extract

<table class="datalogs" cellspacing="5px">
                        <tr>< th>Data1</th><th>Data 2</th><th>Data 3</th><th>Data 4</th><th>Data 4< /th>< th>Data 5</th><th>Data 6</th></tr>
<tr class="odd"><td valign="top"><h4>123<br/></h4></td><td valign="top">AAA</td><td valign="top"><b>url here</b></td><td valign="top">Yes</td><td valign="top">None</td><td valign="top"></td><td valign="top"></td></tr><tr class="even">...
                        </table>

The ideal outcome would be "123", "AAA", "url here", "Yes", "None", "", ""

If this cant be achieved in one go, then just extracting data between the table tags (defined by class="datalogs") and put all results into one file (this would be from a loop which goes through the directory and every file getting this table.

Thanks for your help

回答1:

Doable in Perl, with the help of HTML::TableExtract and Text::CSV:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TableExtract;
use Text::CSV;

my $te = 'HTML::TableExtract'
         ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                           'Data 4', 'Data 5', 'Data 6']);

my $csv = 'Text::CSV'->new({ binary       => 1,
                             eol          => "\n",
                             always_quote => 1,
                           });

while (@ARGV) {
    my $file = shift;
    open my $IN, '<', $file or die $!;
    my $html = do { local $/; <$IN> };
    $te->parse($html);
}
for my $table ($te->tables) {
    $csv->print(*STDOUT{IO}, $_) for $table->rows;
}

I had to fix some error in your sample input (there should be no space between < and the tag name or /).

Update

Adding the file names to the first column: a new TableExtract object created for each file.

#!/usr/bin/perl
use warnings;
use strict;


use HTML::TableExtract;
use Text::CSV;

my $csv = 'Text::CSV'->new({ binary       => 1,
                             eol          => "\n",
                             always_quote => 1,
                           });

for my $file (@ARGV) {
    open my $IN, '<', $file or die $!;
    my $html = do { local $/; <$IN> };
    my $te = 'HTML::TableExtract'
             ->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
                               'Data 4', 'Data 5', 'Data 6']);
    $te->parse($html);
    $csv->print(*STDOUT{IO}, [$file, @$_]) for ($te->tables)[0]->rows;
}


标签: bash sed awk grep