Regex to find tokens - Java Scanner or another alt

2019-06-05 18:35发布

Hi I'm trying to write a class that transfers some text into well defined tokens.

The strings are somewhat similar to code like: (brown) "fox" 'c';. What I would like to get is (either a token from Scanner or an array after slitting I think both would work just fine) ( , brown , ) , "fox" , 'c' , ; separately (as they are potential tokens) which include:

  • quoted text with ' and "
  • number with or without a decimal point
  • parenthesis, braces , semicolon , equals, sharp, ||,<=,&&

Currently I'm doing it with a Scanner, I've had some problems with the delimiter not being able to give me () etc. separately so I've used the following delimiter \s+|(?=[;\{\}\(\)]|\b) the thing now I would get " and ' as separate tokens as well ans I'd really like to avoid it, I've tried adding some negative lookaheads for variations of " but no luck.

I've tried to using StreamTokenizer but it does not keep the different quotes..

P.S. I did search the site and tried to google it but even though there are many Scanner related/Regex related questions, I couldn't find something that will solve my problem.

EDIT 1: So far I came up with \s+|^|(?=[;{}()])|(?<![.\-/'"])(?=\b)(?![.\-/'"]) I might have been not clear enough but when I have some thing like:

"foo";'bar')(;{

gray fox=-56565.4546;

foo boo="hello"{

I'd like to get:

"foo" ,; ,'bar',) , (,; ,{

gray,fox,=,-56565.4546,;

foo,boo,=,"hello",{

But instead I have:

"foo" ,;'bar',) , (,; ,{

gray,fox,=-56565.4546,;

foo,boo,="hello",{

Note that when there are spaces betwen the = and the rest e.g : gray fox = -56565.4546; leads to:

gray,fox,=,-56565.4546,;

What I'm doing with the above mentioned regex is :

Scanner scanner = new Scanner(line);
    scanner.useDelimiter(MY_MENTIONED_REGEX_HERE);
    while (scanner.hasNext()) {
       System.out.println("Got: `" + scanner.next() +"`");
       //Some work here

}

4条回答
狗以群分
2楼-- · 2019-06-05 18:49

Your problem is largely that you are trying to do too much with one regular expression, and consequently not able to understand the interactions of the part. As humans we all have this trouble.

What you are doing has a standard treatment in the compiler business, called "lexing". A lexer generator accepts a regular expression for each individual token of interest to you, and builds a complex set of states that will pick out the individual lexemes, if they are distinguishable. Seperate lexical definitons per token makes them easy and un-confusing to write individually. The lexer generator makes it "easy" and efficient to recognize all the members. (If you want to define a lexeme that has specific quotes included, it is easy to do that).

See any of the parser generators widely available; they all all include lexing engines, e.g., JCup, ANTLR, JavaCC, ...

查看更多
Luminary・发光体
3楼-- · 2019-06-05 18:50

Description

Since you are looking for all alphanumeric text which might include a decimal point, why not just "ignore" the delimiters? The following regex will pull all the alphanumeric with decimal point chunks from your input string. This works because your sample text was:

"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{

Regex: (?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))

enter image description here

Summary

The regex has three paths which are:

  1. (["']?)[-]?[a-z0-9-.]*\1 capture an open quote, followed by a minus sign if it exists, followed by some text or numbers, this continues until it reaches the close quote. This captures any text or numbers with a decimal point. The numbers are not validated so 12.32.1 would match. If your input text also contained numbers prefixed with a plus sign, then change [-] to [+-].
  2. (?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$)) lookbehind for a non alphanumeric if the previous character is a symbol, and the this character is a symbol, the next character is also a symbol or end of string, then grab the current symbol. This captures any free floating symbols which are not quotes, or multiple symbols in a row like )(;{.
  3. (?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))) if the current character is not an alphanumeric or quote, then lookbehind for an alphanumeric or quote symbol and look ahead for non alphanumeric, non quote or end of line. This captures any symbols after a quote which would not be captured by the previous expressions, like the { after "Hello".

Full Explanation

  • (?: start a non group capture statement. Inside this group each alternative is separated by an or | character
    1. 1st alternative: (["']?)[-]?[a-z0-9-.]*\1
      • 1st Capturing group (["']?)
      • Char class ["'] 1 to 0 times matches one of the following chars: "'
      • Char class [-] 1 to 0 times matches one of the following chars: -
      • Char class [a-z0-9-.] infinite to 0 times matches one of the following chars: a-z0-9-.
      • \1 Matches text saved in BackRef 1
    2. 2nd alternative: (?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))
      • (?<=[^a-z0-9]) Positive LookBehind
      • Negated char class [^a-z0-9] matches any char except: a-z0-9
      • Negated char class [^a-z0-9] matches any char except: a-z0-9
      • (?=(?:[^a-z0-9]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
      • Group (?:[^a-z0-9]|$)
      • 1st alternative: [^a-z0-9]
      • Negated char class [^a-z0-9] matches any char except: a-z0-9
      • 2nd alternative: $End of string
    3. 3rd alternative: (?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))
      • (?<=[a-z0-9"']) Positive LookBehind
      • Char class [a-z0-9"'] matches one of the following chars: a-z0-9"'
      • Negated char class [^a-z0-9"'] matches any char except: a-z0-9"'
      • (?=(?:[^a-z0-9]|['"]|$)) Positive LookAhead, each sub alternative is seperated by an or | character
      • Group (?:[^a-z0-9]|['"]|$)
      • 1st alternative: [^a-z0-9]
      • Negated char class [^a-z0-9] matches any char except: a-z0-9
      • 2nd alternative: ['"]
      • Char class ['"] matches one of the following chars: '"
      • 3rd alternative: $End of string
  • ) end the non group capture statement

Groups

Group 0 gets the entire matched string, whereas group 1 gets the quote delimiter if it exists to ensure it'll match a close quote.

Java Code Example:

Note some of the empty values in the array are from the new line character, and some are introduced from the expression. You can apply the expression and some basic logic to ensure your output array only has non empty values.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = "\"foo\";'bar')(;{
gray fox=-56565.4546;
foo boo=\"hello\"{";
  Pattern re = Pattern.compile("(?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))",Pattern.CASE_INSENSITIVE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

 $matches Array:
(
    [0] => Array
        (
            [0] => "foo"
            [1] => 
            [2] => ;
            [3] => 'bar'
            [4] => 
            [5] => )
            [6] => 
            [7] => (
            [8] => 
            [9] => ;
            [10] => 
            [11] => {
            [12] => 
            [13] => 
            [14] => 
            [15] => gray
            [16] => 
            [17] => fox
            [18] => 
            [19] => =
            [20] => -56565.4546
            [21] => 
            [22] => ;
            [23] => 
            [24] => 
            [25] => 
            [26] => foo
            [27] => 
            [28] => boo
            [29] => 
            [30] => =
            [31] => "hello"
            [32] => 
            [33] => {
            [34] => 
        )

    [1] => Array
        (
            [0] => "
            [1] => 
            [2] => 
            [3] => '
            [4] => 
            [5] => 
            [6] => 
            [7] => 
            [8] => 
            [9] => 
            [10] => 
            [11] => 
            [12] => 
            [13] => 
            [14] => 
            [15] => 
            [16] => 
            [17] => 
            [18] => 
            [19] => 
            [20] => 
            [21] => 
            [22] => 
            [23] => 
            [24] => 
            [25] => 
            [26] => 
            [27] => 
            [28] => 
            [29] => 
            [30] => 
            [31] => "
            [32] => 
            [33] => 
            [34] => 
        )

)
查看更多
We Are One
4楼-- · 2019-06-05 18:56

Perhaps using a scanner generator such as JFLex it will be easier to achieve your goal than with a regular expression.

Even if you prefer to write the code by hand, I think it would be better to structure it somewhat more. One simple solution would be to create separate methods which try to "consume" from your text the different types of tokens that you want to recognize. Each such method could tell whether it succeeded or not. This way you have several smaller chunks of code, resposible for the different tokens instead of just one big piece of code which is harder to understand and to write.

查看更多
在下西门庆
5楼-- · 2019-06-05 19:00

The idea is to start from particular cases to general. Try this expression:

Java string:
"([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+"

Raw pattern:
(["'])(?:[^"']+|(?!\1)["'])*\1|\|\||<=|&&|[()\[\]{};=#]|[\w.-]+

The goal here isn't to split with an hypotetic delimiter, but to match entity by entity. Note that the order of alternatives define the priority ( you can't put = before => )

example with your new specifications (need to import Pattern & Matcher):

String s = "(brown) \"fox\" 'c';foo bar || 55.555;\"foo\";'bar')(;{ gray fox=-56565.4546; foo boo=\"hello\"{";
Pattern p = Pattern.compile("([\"'])(?:[^\"']+|(?!\\1)[\"'])*\\1|\\|\\||<=|&&|[()\\[\\]{};=#]|[\\w.-]+");
Matcher m = p.matcher(s) ;  

 while (m.find()) {
    System.out.println("item = `" + m.group() + "`");
}
查看更多
登录 后发表回答