Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell
Say I have the following string
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list
- Hello" World
- Hello Universe
- Hi
The following is my first attempt, but it's got a number of problems
- It leaves the quote characters
- It doesn't catch the escaped quote
Code:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is
- "Hello\"
- World
- " "
- Hello
- Universe
- Hi
EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind
If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):
I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.
The first thing you need to do is stop thinking of the job in terms of
split()
.split()
is meant for breaking down simple strings likethis/that/the other
, where/
is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.
output:
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.
To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.
Step 1: tokenize the input:
/([ \t]+)|(\\")|(")|([^ \t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.
Step 2: build a finite state machine matching and reacting to the tokens:
State: START
State: WITHIN_QUOTES
Step 3: Profit!!
I think if you use pattern like this:
Then it will give you desired output. When I ran with your input data I got this list:
I used
[A-Za-z']+
from your own question but shouldn't it be just :[A-Za-z]+
EDIT
Change your
opts.add(token);
line to: