I have about 200 files located in the same directory, all of which contain a specific piece of content that I need to match using RegExp and either save all of the matched contents into a single array or store them in a new file.
When working with notepad++ regexp engine I do the following to locate the pattern:
<div class="opacity description">(.*)</div>
so that is the pattern I am looking for.
And this is how i Open and List all the files in the directory.
my $d = shift;
opendir(D, "details/") || die "Can't opedir $d: $!\n";
my @list = readdir(D);
closedir(D);
foreach my $f (@list) {
print "\$f = $f\n";
}
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my ($dir) = @ARGV;
my @files = glob "$dir/*";
for my $file (@files) {
my $tree = HTML::TreeBuilder::XPath->new_from_file($file);
my @opacity = $tree->findnodes_as_strings('//div[@class="opacity description"]');
print "\n$file\n";
print " $_\n" for @opacity;
}
You could do this with shell:
if you have recent xarg, it will run grep in paralel (-p) and each processes (-n) several files. It is good, if you have huge and lot of files.
ls -1 | xargs -p3 -n 5 -i grep -HP '<div class="opacity description">(.*)</div>' {}
or with perl
foreach my $f (@list) {
local $/='';
print "\$f = $f\n";
open(FILE,'<',$f) or die $f;
my $c = <FILE>;
close(FILE);
if ($c =~ m!<div class="opacity description">(.*)</div>!){
print "Found in $f\n";
}
}
For processing HTML files it is much safer to use a module that understand HTML and could walk in the DOM tree.