how to make regexp not hungry with quotes?

2020-07-10 05:24发布

how to make it not hungry - preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui', $outStr, $matches);

3条回答
Deceive 欺骗
2楼-- · 2020-07-10 05:45

See: http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

U (PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?. It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).

查看更多
神经病院院长
3楼-- · 2020-07-10 05:49

Do you mean non-greedy, as in find the shortest match instead of the longest? The *, +, and ? quantifiers are greedy by default and will match as much as possible. Add a question mark after them to make them non-greedy.

preg_match_all('/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+?"/ui', $outStr, $matches);

Greedy match:

"foo" and "bar"
^^^^^^^^^^^^^^^

Non-greedy match:

"foo" and "bar"
^^^^^
查看更多
做个烂人
4楼-- · 2020-07-10 05:49

ou suggested

/"[\p{L}\p{Nd}а-яА-ЯёЁ -_\.\+]+"/ui

which I submit is equivalent to:

/"[\pL\p{Nd}а-яА-ЯёЁ -_.+]+"/ui

To show people which non-ASCII you’re using in case it is not obvious, using \x{⋯} escapes that is:

/"[\pL\p{Nd}\x{430}-\x{44F}\x{410}-\x{42F}\x{451}\x{401} -_.+]+"/ui

And using named characters is:

/"[\pL\p{Nd}\N{CYRILLIC SMALL LETTER A}-\N{CYRILLIC SMALL LETTER YA}\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}\N{CYRILLIC SMALL LETTER IO}\N{CYRILLIC CAPITAL LETTER IO} -_.+]+"/ui

BTW, those are produced by running them through the uniquote script, the first using uniquote -x and the second using uniquote -v.

And yes, I know or at least believe that PHP doesn’t support named characters yet, but it makes it easier to talk about. Also, it makes sure they don't confuse the lookalikes:

U+0410 ‹А› \N{CYRILLIC CAPITAL LETTER A}
U+0430 ‹а› \N{CYRILLIC SMALL LETTER A}
U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}

for:

U+0041 ‹A› \N{LATIN CAPITAL LETTER A}
U+0061 ‹a› \N{LATIN SMALL LETTER A}
U+00CB ‹Ë› \N{LATIN CAPITAL LETTER E WITH DIAERESIS}
U+00EB ‹ë› \N{LATIN SMALL LETTER E WITH DIAERESIS}

And now I think about it, those are all letters, so I cannot not see why you are enumerating the Cyrillic list. It is because you don’t want all Cyrillic letters, but rather just that particular set of them? Otherwise I would just do:

/"[\pL\p{Nd} -_.+]+"/ui

At which point I wonder about that /i. I can’t see what its purpose is, so would just write:

/"[\pL\p{Nd} -_.+]+"/u

As has been mentioned, swapping the maximally quantifying + for its corresponding minimal version, +?, will work:

/"[\pL\p{Nd} -_.+]+?"/u

However, I am concerned about that range of [ -_], that is, \p{SPACE}-\p{LOW LINE}. I find that a very peculiar range. It means any of these

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_

For one thing, you’ve included the capital ASCII letters again. For another, you’ve omitted some symbols and punctuation characters:

% unichars -g '\p{ASCII}' '[\pS\pP]' 'ord() < ord(" ") || ord() > ord("_")'
 `  U+0060 GC=Sk GRAVE ACCENT
 {  U+007B GC=Ps LEFT CURLY BRACKET
 |  U+007C GC=Sm VERTICAL LINE
 }  U+007D GC=Pe RIGHT CURLY BRACKET
 ~  U+007E GC=Sm TILDE

(That output is from the unichars script, in case you’re curious.)

Which seems strangely arbitrary. So I’m wondering whether this might not be good enough for you:

/"[\pL\p{Nd}\s\pS\pP]+?"/u

Now that I think about it, these two might cause other problems:

U+0401 ‹Ё› \N{CYRILLIC CAPITAL LETTER IO}
U+0451 ‹ё› \N{CYRILLIC SMALL LETTER IO}

That assumes those are in NFC form (formed by canonical composition of a canonical decomposition). If there were a chance that you are dealing with data that hasn’t been normalized to NFC form, then you would have to account for

NFD("\N{CYRILLIC CAPITAL LETTER IO}") => "\N{CYRILLIC SMALL LETTER IE}\N{COMBINING DIAERESIS}"
NFD("\N{CYRILLIC SMALL LETTER IO}")   => "\N{CYRILLIC CAPITAL LETTER IE}\N{COMBINING DIAERESIS}"

And now you have non-letters! The

% uniprops "COMBINING DIAERESIS"
U+0308 ‹◌̈› \N{COMBINING DIAERESIS}
    \w \pM \p{Mn}
    All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Combining_Diacritical_Marks Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC

So maybe you would actually want:

/"[\pL\pM\p{Nd}\s\pS\pP]+?"/u

If you wanted to restrict your string to containing only characters that are from the Latin or Cyrillic scripts (and not, say, Greek or Katakana), then you would add a lookahead to that effect:

/"(?:(?=[\p{Latin}\p{Cyrillic}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u

Except that you also need Common to get the digits and various puntuation and symbols, and you need Inherited for combining marks following your letters. That brings us up to this:

/"(?:(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+?"/u

That now suggests another way to effect a minimal match between the double quotes:

/"(?:(?!")(?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])[\pL\pM\p{Nd}\s\pS\pP])+"/u

Which is getting way complicated not to run in /x mode:

/
    "               # literal double quote
    (?:
  ### This group specifies a single char with
  ### three separate constraints:

        # Constraint 1: next char must NOT be a double quote
        (?!")

        # Constraint 2: next char must be from one of these four scripts
        (?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])

        # Constraint 3: match one of either Letter, Mark, Decimal Number,
        #               whitespace, Symbol, or Punctuation:
        [\pL\pM\p{Nd}\s\pS\pP]

    )       # end constraint group
    +       # repeat entire group 1 or more times
    "       # and finally match another double-quote
/ux

If it were Perl, I would write that with m{⋯}xu

m{
    "               # literal double quote
    (?:
  ### This group specifies a single char with
  ### three separate constraints:

        # Constraint 1: next char must NOT be a double quote
        (?!")

        # Constraint 2: next char must be from one of these four scripts
        (?=[\p{Latin}\p{Cyrillic}\p{Common}\p{Inherited}])

        # Constraint 3: match one of either Letter, Mark, Decimal Number,
        #               whitespace, Symbol, or Punctuation:
        [\pL\pM\p{Nd}\s\pS\pP]

    )       # end constraint group
    +       # repeat entire group 1 or more times
    "       # and finally match another double-quote
}ux

But I do not know whether you can do paired, bracketing delimiters like that in PHP.

Hope this helps!

查看更多
登录 后发表回答