Hi I'm trying to write a class that transfers some text into well defined tokens.
The strings are somewhat similar to code like: (brown) "fox" 'c';
. What I would like to get is (either a token from Scanner or an array after slitting I think both would work just fine) ( , brown , ) , "fox" , 'c' , ;
separately (as they are potential tokens) which include:
- quoted text with
'
and"
- number with or without a decimal point
- parenthesis, braces , semicolon , equals, sharp, ||,<=,&&
Currently I'm doing it with a Scanner, I've had some problems with the delimiter not being able to give me () etc. separately so I've used the following delimiter \s+|(?=[;\{\}\(\)]|\b)
the thing now I would get "
and '
as separate tokens as well ans I'd really like to avoid it, I've tried adding some negative lookaheads for variations of "
but no luck.
I've tried to using StreamTokenizer but it does not keep the different quotes..
P.S. I did search the site and tried to google it but even though there are many Scanner related/Regex related questions, I couldn't find something that will solve my problem.
EDIT 1:
So far I came up with \s+|^|(?=[;{}()])|(?<![.\-/'"])(?=\b)(?![.\-/'"])
I might have been not clear enough but when
I have some thing like:
"foo";'bar')(;{
gray fox=-56565.4546;
foo boo="hello"{
I'd like to get:
"foo"
,;
,'bar'
,)
, (
,;
,{
gray
,fox
,=
,-56565.4546
,;
foo
,boo
,=
,"hello"
,{
But instead I have:
"foo"
,;'bar'
,)
, (
,;
,{
gray
,fox
,=-56565.4546
,;
foo
,boo
,="hello"
,{
Note that when there are spaces betwen the =
and the rest e.g : gray fox = -56565.4546;
leads to:
gray
,fox
,=
,-56565.4546
,;
What I'm doing with the above mentioned regex is :
Scanner scanner = new Scanner(line);
scanner.useDelimiter(MY_MENTIONED_REGEX_HERE);
while (scanner.hasNext()) {
System.out.println("Got: `" + scanner.next() +"`");
//Some work here
}
Your problem is largely that you are trying to do too much with one regular expression, and consequently not able to understand the interactions of the part. As humans we all have this trouble.
What you are doing has a standard treatment in the compiler business, called "lexing". A lexer generator accepts a regular expression for each individual token of interest to you, and builds a complex set of states that will pick out the individual lexemes, if they are distinguishable. Seperate lexical definitons per token makes them easy and un-confusing to write individually. The lexer generator makes it "easy" and efficient to recognize all the members. (If you want to define a lexeme that has specific quotes included, it is easy to do that).
See any of the parser generators widely available; they all all include lexing engines, e.g., JCup, ANTLR, JavaCC, ...
Description
Since you are looking for all alphanumeric text which might include a decimal point, why not just "ignore" the delimiters? The following regex will pull all the alphanumeric with decimal point chunks from your input string. This works because your sample text was:
Regex:
(?:(["']?)[-]?[a-z0-9-.]*\1|(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))|(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))
Summary
The regex has three paths which are:
(["']?)[-]?[a-z0-9-.]*\1
capture an open quote, followed by a minus sign if it exists, followed by some text or numbers, this continues until it reaches the close quote. This captures any text or numbers with a decimal point. The numbers are not validated so12.32.1
would match. If your input text also contained numbers prefixed with a plus sign, then change[-]
to[+-]
.(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))
lookbehind for a non alphanumeric if the previous character is a symbol, and the this character is a symbol, the next character is also a symbol or end of string, then grab the current symbol. This captures any free floating symbols which are not quotes, or multiple symbols in a row like)(;{
.(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$)))
if the current character is not an alphanumeric or quote, then lookbehind for an alphanumeric or quote symbol and look ahead for non alphanumeric, non quote or end of line. This captures any symbols after a quote which would not be captured by the previous expressions, like the{
after"Hello"
.Full Explanation
|
character(["']?)[-]?[a-z0-9-.]*\1
(["']?)
["']
1 to 0 times matches one of the following chars:"'
[-]
1 to 0 times matches one of the following chars:-
[a-z0-9-.]
infinite to 0 times matches one of the following chars:a-z0-9-.
\1
Matches text saved in BackRef 1(?<=[^a-z0-9])[^a-z0-9](?=(?:[^a-z0-9]|$))
(?<=[^a-z0-9])
Positive LookBehind[^a-z0-9]
matches any char except:a-z0-9
[^a-z0-9]
matches any char except:a-z0-9
(?=(?:[^a-z0-9]|$))
Positive LookAhead, each sub alternative is seperated by an or|
character(?:[^a-z0-9]|$)
[^a-z0-9]
[^a-z0-9]
matches any char except:a-z0-9
(?<=[a-z0-9"'])[^a-z0-9"'](?=(?:[^a-z0-9]|['"]|$))
(?<=[a-z0-9"'])
Positive LookBehind[a-z0-9"']
matches one of the following chars:a-z0-9"'
[^a-z0-9"']
matches any char except:a-z0-9"'
(?=(?:[^a-z0-9]|['"]|$))
Positive LookAhead, each sub alternative is seperated by an or|
character(?:[^a-z0-9]|['"]|$)
[^a-z0-9]
[^a-z0-9]
matches any char except:a-z0-9
['"]
['"]
matches one of the following chars:'"
)
end the non group capture statementGroups
Group 0 gets the entire matched string, whereas group 1 gets the quote delimiter if it exists to ensure it'll match a close quote.
Java Code Example:
Note some of the empty values in the array are from the new line character, and some are introduced from the expression. You can apply the expression and some basic logic to ensure your output array only has non empty values.
Perhaps using a scanner generator such as JFLex it will be easier to achieve your goal than with a regular expression.
Even if you prefer to write the code by hand, I think it would be better to structure it somewhat more. One simple solution would be to create separate methods which try to "consume" from your text the different types of tokens that you want to recognize. Each such method could tell whether it succeeded or not. This way you have several smaller chunks of code, resposible for the different tokens instead of just one big piece of code which is harder to understand and to write.
The idea is to start from particular cases to general. Try this expression:
The goal here isn't to split with an hypotetic delimiter, but to match entity by entity. Note that the order of alternatives define the priority ( you can't put
=
before=>
)example with your new specifications (need to import Pattern & Matcher):