I have directory with about 900 html documents in it, each document contains the same table tags (easily defined) in that table is data which I need to extract and output in a csv format. What is the best way to do this and how can I do it?
Here is an example of what is in each html file which I need to extract
<table class="datalogs" cellspacing="5px">
<tr>< th>Data1</th><th>Data 2</th><th>Data 3</th><th>Data 4</th><th>Data 4< /th>< th>Data 5</th><th>Data 6</th></tr>
<tr class="odd"><td valign="top"><h4>123<br/></h4></td><td valign="top">AAA</td><td valign="top"><b>url here</b></td><td valign="top">Yes</td><td valign="top">None</td><td valign="top"></td><td valign="top"></td></tr><tr class="even">...
</table>
The ideal outcome would be
"123", "AAA", "url here", "Yes", "None", "", ""
If this cant be achieved in one go, then just extracting data between the table tags (defined by class="datalogs") and put all results into one file (this would be from a loop which goes through the directory and every file getting this table.
Thanks for your help
Doable in Perl, with the help of HTML::TableExtract and Text::CSV:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use Text::CSV;
my $te = 'HTML::TableExtract'
->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
'Data 4', 'Data 5', 'Data 6']);
my $csv = 'Text::CSV'->new({ binary => 1,
eol => "\n",
always_quote => 1,
});
while (@ARGV) {
my $file = shift;
open my $IN, '<', $file or die $!;
my $html = do { local $/; <$IN> };
$te->parse($html);
}
for my $table ($te->tables) {
$csv->print(*STDOUT{IO}, $_) for $table->rows;
}
I had to fix some error in your sample input (there should be no space between <
and the tag name or /
).
Update
Adding the file names to the first column: a new TableExtract object created for each file.
#!/usr/bin/perl
use warnings;
use strict;
use HTML::TableExtract;
use Text::CSV;
my $csv = 'Text::CSV'->new({ binary => 1,
eol => "\n",
always_quote => 1,
});
for my $file (@ARGV) {
open my $IN, '<', $file or die $!;
my $html = do { local $/; <$IN> };
my $te = 'HTML::TableExtract'
->new(headers => ['Data1', 'Data 2', 'Data 3', 'Data 4',
'Data 4', 'Data 5', 'Data 6']);
$te->parse($html);
$csv->print(*STDOUT{IO}, [$file, @$_]) for ($te->tables)[0]->rows;
}