From wildcards to regular expressions

2020-07-16 08:19发布

I want to allow the two main wildcards ? and * to filter my data.

Here is how I'm doing now (as I saw on many websites):

public boolean contains(String data, String filter) {
    if(data == null || data.isEmpty()) {
        return false;
    }
    String regex = filter.replace(".", "[.]")
                         .replace("?", ".")
                         .replace("*", ".*");
    return Pattern.matches(regex, data);
}

But shouldn't we escape all the other regex special chars, like | or (, etc.? And also, maybe we could preserve ? and * if they are preceded by a \? For example, something like:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("([^\\\\]|^)\\?", "$1.")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("([^\\\\]|^)\\*", "$1.*")          // 3. replace any * that isn't preceded by a \ by .*
      .replaceAll("\\\\([^?*]|$)", "\\\\\\\\$1");    // 4. replace any \ that isn't followed by a ? or a * (possibly due to step 2 and 3) by \\

What do you think about it? If you agree, am I missing any other regex special char?


Edit #1 (after having taken into account dan1111's and m.buettner's advices):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars, but \, ? and *
regex = regex.replaceAll("([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

What about this one?


Edit #2 (after having taken into account dan1111's advices):

// replace any even number of backslashes by a *
regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
// reduce redundant wildcards that aren't preceded by a \
regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
// escape regexps special chars (if not already escaped by user), but \, ? and *
regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
// replace ? that aren't preceded by a \ by .
regex = regex.replaceAll("(?<!\\\\)[?]", ".");
// replace * that aren't preceded by a \ by .*
regex = regex.replaceAll("(?<!\\\\)[*]", ".*");

Goal in sight?

3条回答
男人必须洒脱
2楼-- · 2020-07-16 08:40

You don't need 4 backslashes in the replacement string to write out a single one. Two backslashes are enough.

And you can avoid the ([^\\\\]|^) and the $1 in the replacement string by using a negative lookbehind:

filter.replaceAll("([$|\\[\\]{}(),.+^-])", "\\$1") // 1. escape regex special chars, but ?, * and \
      .replaceAll("(?<!\\\\)[?]", ".")           // 2. replace any ? that isn't preceded by a \ by .
      .replaceAll("(?<!\\\\)[*]", ".*")          // 3. replace any * that isn't preceded by a \ by .*

I don't really see what you need the last step for. Wouldn't that escape the backslashes that escape your meta-characters (in turn, actually not escaping them). I'm ignoring the fact that your replacement call would have written out 4 backslashes instead of only two. But say your original input had th|is. Then your first replacement would make that th\|is. Then the last replacement would make that th\\|is which matches either th-backslash or is.

You need to differentiate between how your string looks written in code (uncompiled, with twice as many backslashes) and how it looks after it was compiled (containing only half the amount of backslashes).

You might also want to think about restricting the number of possible *. A regex like .*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*.*! (where ! can not be found in the input) can take quite a while to run. The issue is called catastrophic backtracking.

查看更多
The star\"
3楼-- · 2020-07-16 08:52

Here is finally the solution I adopted (using the Apache Commons Lang library):

public static boolean isFiltered(String data, String filter) {
    // no filter: return true
    if (StringUtils.isBlank(filter)) {
        return true;
    }
    // a filter but no data: return false
    else if (StringUtils.isBlank(data)) {
        return false;
    }
    // a filter and a data:
    else {
        // case insensitive
        data = data.toLowerCase();
        filter = filter.toLowerCase();
        // .matches() auto-anchors, so add [*] (i.e. "containing")
        String regex = "*" + filter + "*";
        // replace any pair of backslashes by [*]
        regex = regex.replaceAll("(?<!\\\\)(\\\\\\\\)+(?!\\\\)", "*");
        // minimize unescaped redundant wildcards
        regex = regex.replaceAll("(?<!\\\\)[?]*[*][*?]+", "*");
        // escape unescaped regexps special chars, but [\], [?] and [*]
        regex = regex.replaceAll("(?<!\\\\)([|\\[\\]{}(),.^$+-])", "\\\\$1");
        // replace unescaped [?] by [.]
        regex = regex.replaceAll("(?<!\\\\)[?]", ".");
        // replace unescaped [*] by [.*]
        regex = regex.replaceAll("(?<!\\\\)[*]", ".*");
        // return whether data matches regex or not
        return data.matches(regex);
    }
}

Many thanks to @dan1111 and @m.buettner for their precious help ;)

查看更多
forever°为你锁心
4楼-- · 2020-07-16 08:53

Try this simpler version:

String regex = Pattern.quote(filter).replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");

This quotes the whole filter with \Q and \E, and then stops the quoting on * and ?, replacing them with their pattern equivalent (.* and .)

I tested it with

String simplePattern = "ab*g\\Ei\\.lmn?p";
String data = "abcdefg\\Ei\\.lmnop";
String quotedPattern = Pattern.quote(simplePattern);
System.out.println(quotedPattern);
String regex = quotedPattern.replace("*", "\\E.*\\Q").replace("?", "\\E.\\Q");
System.out.println(regex);
System.out.println(data.matches(regex));

Output:

\Qab*g\E\\E\Qi\.lmn?p\E
\Qab\E.*\Qg\E\\E\Qi\.lmn\E.\Qp\E
true

Notice this is based on Oracle's implementation of Pattern.quote, I don't know if there are other valid implementations.

查看更多
登录 后发表回答