Read file content and use Regular Expression to lo

2019-09-22 06:01发布

问题:

I have about 200 files located in the same directory, all of which contain a specific piece of content that I need to match using RegExp and either save all of the matched contents into a single array or store them in a new file.

When working with notepad++ regexp engine I do the following to locate the pattern:

<div class="opacity description">(.*)</div>

so that is the pattern I am looking for.

And this is how i Open and List all the files in the directory.

my $d = shift;

opendir(D, "details/") || die "Can't opedir $d: $!\n";
my @list = readdir(D);
closedir(D);

foreach my $f (@list) {
  print "\$f = $f\n";
}

回答1:

use strict;
use warnings;

use HTML::TreeBuilder::XPath;

my ($dir) = @ARGV;

my @files = glob "$dir/*";

for my $file (@files) {
  my $tree = HTML::TreeBuilder::XPath->new_from_file($file);
  my @opacity = $tree->findnodes_as_strings('//div[@class="opacity description"]');
  print "\n$file\n";
  print "  $_\n" for @opacity;
}


回答2:

You could do this with shell:

if you have recent xarg, it will run grep in paralel (-p) and each processes (-n) several files. It is good, if you have huge and lot of files.

ls -1 | xargs -p3 -n 5 -i grep -HP '<div class="opacity description">(.*)</div>' {}

or with perl

foreach my $f (@list) {
  local $/='';  
  print "\$f = $f\n";
  open(FILE,'<',$f) or die $f;
  my $c = <FILE>;
  close(FILE);
  if ($c =~ m!<div class="opacity description">(.*)</div>!){
    print "Found in $f\n";
  }
}

For processing HTML files it is much safer to use a module that understand HTML and could walk in the DOM tree.