Java Regexp: UNGREEDY flag

2019-08-27 17:35发布

I'd like to port a generic text processing tool, Texy!, from PHP to Java.

This tool does ungreedy matching everywhere, using preg_match_all("/.../U"). So I am looking for a library, which has some UNGREEDY flag.

I know I could use the .*? syntax, but there are really many regular expressions I would have to overwrite, and check them with every updated version.

I've checked

  • ORO - seems to be abandoned
  • Jakarta Regexp - no support
  • java.util.regex - no support

Is there any such library?

Thanks, Ondra

4条回答
劫难
2楼-- · 2019-08-27 17:53

I suggest you create your own modified Java library. Simply copy the java.util.regex source into your own package.

The Sun JDK 1.6 Pattern.java class offers these default flags:

static final int GREEDY     = 0;

static final int LAZY       = 1;

static final int POSSESSIVE = 2;

You'll notice that these flags are only used a couple of times, and it would be trivial to modify. Take the following example:

    case '*':
        ch = next();
        if (ch == '?') {
            next();
            return new Curly(prev, 0, MAX_REPS, LAZY);
        } else if (ch == '+') {
            next();
            return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
        }
        return new Curly(prev, 0, MAX_REPS, GREEDY);

Simply change the last line to use the 'LAZY' flag instead of the GREEDY flag. Since your wanting a regex library to behave like the PHP one, this might be the best way to go.

查看更多
唯我独甜
3楼-- · 2019-08-27 18:01

About the idea of checking and rechecking all regular expressions, are you sure that the php and java libraries agree enough on syntax that you wouldn't have to do this anyway? What I'd do up front is go through them all and write some tests (input and output) and make sure that they work the same in both implementations. Then devise a way to run them automatically and you will be covered for future upgrades and incompatibilities. You'll still need to tweak stuff, but at least you'll know where.

查看更多
干净又极端
4楼-- · 2019-08-27 18:07

You may be able to use 'com.caucho.quercus.lib.regexp.JavaRegexpModule'. Quercus is a Java implementation of PHP, and the regex library implements the PHP regex syntax and method names.

查看更多
干净又极端
5楼-- · 2019-08-27 18:08

Update: After checking the docs I found the LAZY flag, which is another term for non-greedy. However it only appears to be available in OpenJDK

p = Pattern.compile("your regex here", LAZY);
p.matcher("string to match")

Original deprecated response I honestly don't think there's one.

The whole point of the +? and *? is so that you can choose which sections to do greedily and which ones to do lazily.

Greedy is the default behaviour because that's the most commonly use of + and * in regular expressions. In fact I can't think of a single regex parser that does it the other way around. As in where a modifier is used to make something greedy, and the default is lazy matching.

I know this isn't the answer you're looking for, but, the only way I think you'll be able to make it work is to add the ? to your *'s and +'s. On the upside you can use regular expressions to help determine which ones need to be changed. Or even make the changes for you if all of them need to be changed. Or if you can can describe a pattern that identifies which need to be changed.

查看更多
登录 后发表回答