Workaround Perl regexp limit?

2019-07-20 15:59发布

I wrote a program to extract attachments from mail folders ( GITHUB) but it fails because of Perl's 32767 line limit on regex matching. My program loads each mail message as a single string, and then tries to match each base64-encoded file as a single string.

To replicate the problem, first do this:

(dd if=/dev/urandom bs=2000 count=1000 | base64 ; echo "\n\n\n" ; dd if=/dev/urandom bs=2000 count=1000 | base64 ) >! /tmp/testfile.txt 

This creates a single 5403516 byte file that contains the base64-encoding of two files with a triple newline buffer between them. The situation in production is a little more complex, but this simpler case demonstrates the problem.

Our goal is to extract the base64-encoding of the first file. In other words, all consecutive lines that are 50 characters or longer and contain only base64 characters, but stopping when we see the first "=" sign (which indicates end-of-file in base64).

/tmp/testfile.txt has 70180 lines, with the first 35088 lines representing the string we want to capture (the base64-encode of the first file).

We now do the following in Perl:

# next 4 lines: read the entire file into a single variable 
undef $/; 
open(A,"/tmp/testfile.txt"); 
$all = <A>; 
close(A); 

# the output of base64 consists of these characters (plus "=" and 
# "\n", but those two are special cases) 
my($chars) = "[a-zA-Z0-9\+\/]"; 

# we declare a subroutine for testing 
sub foo {print STDERR length($_[0]),"\n";} 

# this is what I tried to do originally 
$all=~s/(\n($chars{50,}\=*\n)+)($chars+\=*\n)/foo("$1$3")/seg; 

The above yields "2523137" then "178467" then "2523137" then "178544" to the STDERR.

In other words, it captures the first 2523137 characters of the first file, then the next 178467 characters of the first file, instead of capturing all 2701604 characters of the first file like I want. Note that 2523137 is approximately 77*32767 (and each line of /tmp/testfile.txt is 77 characters long).

@ikegami, if I understand correctly, your approach is:

$all=~s/((\n($chars{50,}\=*\n){0,20000})+)($chars+\=*\n)//seg; 

In other words, capture 20000 lines at a time (avoiding the 32767 line limit), but capture multiple bunches of 20000 lines. Is this correct?

Since the results will come out in multiple variables, I didn't pass the result to foo(), but instead printed the results to STDERR like this:

print STDERR "1 is $1\n"; 
print STDERR "2 is $2\n"; 
print STDERR "3 is $3\n"; 
print STDERR "4 is $4\n"; 
print STDERR "5 is $5\n"; 
print STDERR "6 is $6\n"; 

This yields $1 and $2 as identical 15085 line variables, $3 and $4 as non-identical one line variables, and $5 and $6 as empty.

Thus, I think I misunderstood your approach. Help?

标签: regex perl size
1条回答
成全新的幸福
2楼-- · 2019-07-20 16:29

Since you can split your base64 pieces by a static string, you can use $/ to split up the file much more efficiently and then choose whether each piece matches your criterion.

use strict;
use warnings;
use autodie;

my $is_base64 = qr{^[a-zA-Z0-9\+\/]+\n?$}m;

{
    open(my $fh,"/tmp/testfile.txt");
    local $/ = "=\n";

    while(my $base64 = <$fh>) {
        chomp $base64;
        _strip(\$base64);
        next unless $base64 =~ $is_base64;

        print STDERR length $base64, "\n";
    }
}

sub _strip {
    my $ref = shift;
    $$ref =~ s{^\s+}{};
    $$ref =~ s{\s+$}{};

    return;
}

This is also handy for splitting up mailboxes, set $/ to "\n\nFrom ".

But the comments suggesting that you should be doing this with a module are correct. There's a lot of mail modules on CPAN so it can be a bit difficult to find the right one.

查看更多
登录 后发表回答