how to make it not hungry - preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui', $outStr, $matches);
相关问题
- Views base64 encoded blob in HTML with PHP
- Laravel Option Select - Default Issue
- PHP Recursively File Folder Scan Sorted by Modific
- Can php detect if javascript is on or not?
- Using similar_text and strpos together
See: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
Do you mean non-greedy, as in find the shortest match instead of the longest? The
*
,+
, and?
quantifiers are greedy by default and will match as much as possible. Add a question mark after them to make them non-greedy.Greedy match:
Non-greedy match:
ou suggested
which I submit is equivalent to:
To show people which non-ASCII you’re using in case it is not obvious, using
\x{⋯}
escapes that is:And using named characters is:
BTW, those are produced by running them through the uniquote script, the first using
uniquote -x
and the second usinguniquote -v
.And yes, I know or at least believe that PHP doesn’t support named characters yet, but it makes it easier to talk about. Also, it makes sure they don't confuse the lookalikes:
for:
And now I think about it, those are all letters, so I cannot not see why you are enumerating the Cyrillic list. It is because you don’t want all Cyrillic letters, but rather just that particular set of them? Otherwise I would just do:
At which point I wonder about that
/i
. I can’t see what its purpose is, so would just write:As has been mentioned, swapping the maximally quantifying
+
for its corresponding minimal version,+?
, will work:However, I am concerned about that range of
[ -_]
, that is,\p{SPACE}-\p{LOW LINE}
. I find that a very peculiar range. It means any of theseFor one thing, you’ve included the capital ASCII letters again. For another, you’ve omitted some symbols and punctuation characters:
(That output is from the unichars script, in case you’re curious.)
Which seems strangely arbitrary. So I’m wondering whether this might not be good enough for you:
Now that I think about it, these two might cause other problems:
That assumes those are in NFC form (formed by canonical composition of a canonical decomposition). If there were a chance that you are dealing with data that hasn’t been normalized to NFC form, then you would have to account for
And now you have non-letters! The
So maybe you would actually want:
If you wanted to restrict your string to containing only characters that are from the Latin or Cyrillic scripts (and not, say, Greek or Katakana), then you would add a lookahead to that effect:
Except that you also need
Common
to get the digits and various puntuation and symbols, and you needInherited
for combining marks following your letters. That brings us up to this:That now suggests another way to effect a minimal match between the double quotes:
Which is getting way complicated not to run in
/x
mode:If it were Perl, I would write that with
m{⋯}xu
But I do not know whether you can do paired, bracketing delimiters like that in PHP.
Hope this helps!