I'm looking through the MarkDown code written in Perl by John Gruber, and there is a sub called _Detab
that converts tabs to spaces while preserving the indentation of the text. The line of code in question is 1314 in Markdown.pl:
$text =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;
Wouldn't this cause unnecessary backtracking? Wouldn't the following pattern perform more efficiently?
/([^\t\n]*)\t/
Or am I missing something? Thanks.
BTW, I'm only negating \n
and not \r
because all line breaks are standardized to \n
beforehand.
Don't guess when you can benchmark:
use Benchmark 'cmpthese';
my $source = "\t\thello\n\t\t\tworld\n" x 100;
my $g_tab_width = 8;
my ($textU, $textN);
cmpthese(-3, {
ungreedy => sub {
$textU = $source;
$textU =~ s{(.*?)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;
},
negated => sub {
$textN = $source;
$textN =~ s{([^\n\t]*)\t}{$1.(' ' x ($g_tab_width - length($1) % $g_tab_width))}ge;
},
});
die "whoops" unless $textN eq $textU; # ensure they do the same thing
I find that the non-greedy version (as it appears in the Markdown source) is roughly 40% faster than the negated character class you suggest:
Rate negated ungreedy
negated 1204/s -- -30%
ungreedy 1718/s 43% --
My guess would be that matching .
is more efficient than the negated character class, which makes up for the extra backtracking. More tests would be necessary to confirm that.
You are correct. This would cause unnecessary backtracking. Yes, your pattern would be more efficient.
Most people don't really understand or think about how regexps work and/or just do things the way they've been taught. I don't know the particulars of this code or the author, but that's a very common regexp you'll see in perl code.
And, to be honest, for most use cases it doesn't really make that much of a difference.