Why is my search in BBEdit causing a “stack overfl

2019-05-14 11:29发布

问题:

I'm stumped about a "stack overflow" error--"out of stack space (application error code: 12246)--that I'm getting in BBEdit when I do a "replace all", searching for

(@article(((?!eprint|@article|@book).)*\r)*)pmid = {(.+)}((((?!eprint|@article|@book).)*\r)*(@|\r*\z))

and replacing with

\1eprinttype = {pubmed}, eprint = {\4}\5

I can use these same patterns manually, doing one-at-a-time find & replace, without any errors, even once the match no longer occurs. I can also avoid the error by working on smaller files.

I suspect that it's my inefficient and sloppy regex coding that's to blame, and would appreciate an expert's help in doing this more efficiently. I'm trying to locate all entries in a BibLaTeX bibliography that don't already have an eprint field, but which have a pmid field, and replace the pmid field with a corresponding e-print specification (using eprint and eprinttype).


Update: After some experimentation, I've found that a different approach is the only thing I can get to work. Searching for

(?(?=@article(.+\r)+eprint = {(.+\r)+}\r*)(?!)|(@article(.+\r)+)pmid = {(.+)}((.+\r)+}\r*))

and replacing with

\3eprinttype = {pubmed}, eprint = {\5}\6

does the trick. The only problem with this is the backreferences are fragile, but I can't get named backreferences to work in BBEdit.

回答1:

It's probably catastrophic backtracking caused by this last part:

.)*\r)*(@|\r*\z))

If you break that down and simplify it, you essentially have a .*, a \r*, and another \r* right next to each other. Now picture a string of \r characters at the end of your input: How should each \r be distributed? Which of those little clauses will soak up each \r character? If you have \r\r\r\r\r, you could eat all five \rs with the .* part and none at all with the \r* parts...or, you can make up any number of permutations that will still match. Since the * is greedy, it will try to fill the .* up first, but if that fails, it has to keep trying permutations until one of them works. So it's probably hogging a bunch of your resources with unnecessary backtracking, until finally it crashes.

I'm not an expert on optimization techniques for regex, but I'd start there if I were you.

Update:

Check out the Wikipedia article on PCRE:

Unless the "NoRecurse" PCRE build option (aka "--disable-stack-for-recursion") is chosen, adequate stack space must be allocated to PCRE by the calling application or operating system. ... While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.

So I think catastrophic backtracking is a good bet here. I'd try to solve it by tweaking your regex before changing the build options on PCRE.



回答2:

Obviously this is some bug. But you could try changing the expression a bit. It's difficult to optimize the expression without knowing the requirements, but here's a guess:

(@article(?:(?:(?!eprint|@article|@book|pmid)[^\r])*+\r)*+)pmid = {([^\n\r]+)}((?:(?:(?!eprint|@article|@book)[^\r])*+\r)*(?:@|\r*\z))

Replace with:

\1eprinttype = {pubmed}, eprint = {\2}\3

BBEdit seems to use PCRE, unless it's (very) outdated the above expression should be compatible.