Regex lazy quantifier vs negation class in MarkDow

2019-05-18 18:40发布

问题:

I'm looking through the MarkDown code written in Perl by John Gruber, and there is a sub called _Detab that converts tabs to spaces while preserving the indentation of the text. The line of code in question is 1314 in Markdown.pl:

$text =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;

Wouldn't this cause unnecessary backtracking? Wouldn't the following pattern perform more efficiently?

/([^\t\n]*)\t/

Or am I missing something? Thanks.

BTW, I'm only negating \n and not \r because all line breaks are standardized to \n beforehand.

回答1:

Don't guess when you can benchmark:

use Benchmark 'cmpthese';

my $source = "\t\thello\n\t\t\tworld\n" x 100;
my $g_tab_width = 8;

my ($textU, $textN);

cmpthese(-3, {
  ungreedy => sub {
    $textU = $source;
    $textU =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;
  },

 negated => sub {
    $textN = $source;
    $textN =~ s{([^\n\t]*)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;
  },
});

die "whoops" unless $textN eq $textU; # ensure they do the same thing

I find that the non-greedy version (as it appears in the Markdown source) is roughly 40% faster than the negated character class you suggest:

           Rate  negated ungreedy
negated  1204/s       --     -30%
ungreedy 1718/s      43%       --

My guess would be that matching . is more efficient than the negated character class, which makes up for the extra backtracking. More tests would be necessary to confirm that.



回答2:

You are correct. This would cause unnecessary backtracking. Yes, your pattern would be more efficient.

Most people don't really understand or think about how regexps work and/or just do things the way they've been taught. I don't know the particulars of this code or the author, but that's a very common regexp you'll see in perl code.

And, to be honest, for most use cases it doesn't really make that much of a difference.