I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?
FWIW, ICU also offers a *+ operator, but it doesn't work either.
EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.
If you simply change every *
quantifier to a +
, the regex will fail to work in those instances where the *
should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.
However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x*
could be rewritten as (?:(?!x)|x+)
. It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+
), but not reluctant stars (*?
).
Here it is in table form:
BEFORE AFTER
x* (?:(?!x)|x+)
x*+ (?:(?!x)|x++)
x*? x*?
More complex atoms would need to have their own parentheses preserved:
(?:xyz)* (?:(?!(?:xyz))|(?:xyz)+)
You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the
{min,}
and
{min,max}
forms are affected too, they would get the same treatment (with the same modifications for possessive variants):
x{0,} same as x*
x{0,n} (?:(?!x)|x{1,n})
It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)
--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.
I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)
I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.
Both \*
and [*]
are literal asterisks, so a naive replacement mightn't work.
In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.
x*
is equivalent to x{0,}
and (?:x+)?
.
Yeah, use that strategy:
(pseudo code)
if ($str =~ /x*/ && $str =~ /(x+)/) {
print "'$1'\n";
}
But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.