I want to allow the two main wildcards ?
and *
to filter my data.
Here is how I'm doing now (as I saw on many websites):
public boolean contains(String data, String filter) {
if(data == null || data.isEmpty()) {
return false;
}
String regex = filter.replace(".", "[.]")
.replace("?", ".")
.replace("*", ".*");
return Pattern.matches(regex, data);
}
But shouldn't we escape all the other regex special chars, like |
or (
, etc.? And also, maybe we could preserve ?
and *
if they are preceded by a \
? For example, something like:
filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\\\$1") // 1. escape regex special chars, but ?, * and \
.replaceAll("([^\\\\]|^)\\?", "$1.") // 2. replace any ? that isn't preceded by a \ by .
.replaceAll("([^\\\\]|^)\\*", "$1.*") // 3. replace any * that isn't preceded by a \ by .*
.replaceAll("\\\\([^?*]|$)", "\\\\\\\\$1"); // 4. replace any \ that isn't followed by a ? or a * (possibly due to step 2 and 3) by \\
What do you think about it? If you agree, am I missing any other regex special char?
Edit #1 (after having taken into account dan1111's and m.buettner's advices):
// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars, but \, ? and *
regex = regex.replaceAll("([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");
What about this one?
Edit #2 (after having taken into account dan1111's advices):
// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars (if not already escaped by user), but \, ? and *
regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");
Goal in sight?
You don't need 4 backslashes in the replacement string to write out a single one. Two backslashes are enough.
And you can avoid the
([^\\\\]|^)
and the$1
in the replacement string by using a negative lookbehind:I don't really see what you need the last step for. Wouldn't that escape the backslashes that escape your meta-characters (in turn, actually not escaping them). I'm ignoring the fact that your replacement call would have written out 4 backslashes instead of only two. But say your original input had
th|is
. Then your first replacement would make thatth\|is
. Then the last replacement would make thatth\\|is
which matches eitherth
-backslash oris
.You need to differentiate between how your string looks written in code (uncompiled, with twice as many backslashes) and how it looks after it was compiled (containing only half the amount of backslashes).
You might also want to think about restricting the number of possible
*
. A regex like.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*!
(where!
can not be found in the input) can take quite a while to run. The issue is called catastrophic backtracking.Here is finally the solution I adopted (using the Apache Commons Lang library):
Many thanks to @dan1111 and @m.buettner for their precious help
;)
Try this simpler version:
This quotes the whole filter with
\Q
and\E
, and then stops the quoting on*
and?
, replacing them with their pattern equivalent (.*
and.
)I tested it with
Output:
Notice this is based on Oracle's implementation of
Pattern.quote
, I don't know if there are other valid implementations.