Is there a native "PHP way" to parse command arguments from a string
? For example, given the following string
:
foo "bar \"baz\"" '\'quux\''
I'd like to create the following array
:
array(3) {
[0] =>
string(3) "foo"
[1] =>
string(7) "bar "baz""
[2] =>
string(6) "'quux'"
}
I've already tried to leverage token_get_all()
, but PHP's variable interpolation syntax (e.g. "foo ${bar} baz"
) pretty much rained on my parade.
I know full well that I could write my own parser. Command argument syntax is super simplistic, but if there's an existing native way to do it, I'd much prefer that over rolling my own.
EDIT: Please note that I am looking to parse the arguments from a string
, NOT from the shell/command-line.
EDIT #2: Below is a more comprehensive example of the expected input -> output for arguments:
foo -> foo
"foo" -> foo
'foo' -> foo
"foo'foo" -> foo'foo
'foo"foo' -> foo"foo
"foo\"foo" -> foo"foo
'foo\'foo' -> foo'foo
"foo\foo" -> foo\foo
"foo\\foo" -> foo\foo
"foo foo" -> foo foo
'foo foo' -> foo foo
I wrote some packages for console interactions:
Arguments parsing
There is a package that does the whole arguments parsing thing weew/php-console-arguments
Example:
$args
will be an array:Arguments can be grouped:
$args
will become:It can do much more, just check the readme.
Output styling
You might need a package for output styling weew/php-console-formatter
Console application
Packages above can be used standalone or in combination with a fancy console application skeleton weew/php-console
Note: This solutions are not native but might still be useful to some people.
I've worked out the following expression to match the various enclosures and escapement:
It matches:
Afterwards, you need to (carefully) remove the escaped characters:
Update
For the fun of it, I've written a more formal parser, outlined below. It won't give you better performance, it's about three times slower than the regular expression mostly due its object oriented nature. I suppose the advantage is more academic than practical:
It's based on
StringIterator
to walk through the string one character at a time:Well, you could also build this parser with a recursive regex:
Now that's a bit long, so let's break it out:
So how does this work? Well, the identifier should be obvious...
The two quoted sub-patterns are basically, the same, so let's look at the single quoted string:
Really, that's a quote character followed by a recursive sub-pattern, followed by a end quote.
The magic happens in the sub-pattern.
That part basically consumes any non-quote and non-escape character. We don't care about them, so eat them up. Then, if we encounter either a quote or a backslash, trigger an attempt to match the entire sub-pattern again.
If we can consume a backslash, then consume the next character (without caring what it is), and recurse again.
Finally, we have an empty component (if the escaped character is last, or if there's no escape character).
Running this on the test input @HamZa provided returns the same result:
The main difference that happens is in terms of efficiency. This pattern should backtrack less (since it's a recursive pattern, there should be next to no backtracking for a well-formed string), where the other regex is a non-recursive regex and will backtrack every single character (that's what the
?
after the*
forces, non-greedy pattern consumption).For short inputs this doesn't matter. The test case provided, they run within a few % of each other (margin of error is greater than the difference). But with a single long string with no escape sequences:
The difference is significant (100 runs):
float(0.00030398368835449)
float(0.00055909156799316)
Of course, we can partially lose this advantage with a lot of escape sequences:
float(0.00040411949157715)
float(0.00045490264892578)
But note that the length still dominates. That's because the backtracker scales at
O(n^2)
, where the recursive solution scales atO(n)
. However, since the recursive pattern always needs to recurse at least once, it's slower than the backtracking solution on short strings:float(0.0002598762512207)
float(0.00017595291137695)
The tradeoff appears to happen around 15 characters... But both are fast enough that it won't make a difference unless you're parsing several KB or MB of data... But it's worth discussing...
On sane inputs, it won't make a significant difference. But if you're matching more than a few hundred bytes, it may start to add up significantly...
Edit
If you need to handle arbitrary "bare words" (unquoted strings), then you can change the original regex to:
However, it really depends on your grammar and what you consider a command or not. I'd suggest formalizing the grammar you expect...
You can simply just use str_getcsv and do few cosmetic surgery with stripslashes and trim
Example :
Output
Caution
There is nothing like a unversal format for argument is best you spesify specific format and the easiest have seen is CSV
Example
Using CSV you can simple have this output
Based on HamZa's answer: