Utf8 correct regex for CamelCase (WikiWord) in per

Here was a question about the CamelCase regex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.

Starting with (brian d foy's) regex:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

and modifying to:

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

Have a problem with lines marked '###'.

In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.

Updated: (ysth comment)

for the next,

any: mean "uppercase or lowercase or number or underscore"

The regex should match CamelWord, CaW

start with uppercase letter
optional any
lowercase letter or number or underscore
optional any
upper case letter
optional any

Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.

I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

Never use [a-z]. And in fact, don’t use \p{Lowercase_Letter} or \p{Ll}, since those are not the same as the more desirable and more correct \p{Lowercase} and \p{Lower}.

And remember that \w is really just an alias for

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

Utf8 correct regex for CamelCase (WikiWord) in per

问题:

回答1:

收藏的人(0)

Utf8 correct regex for CamelCase (WikiWord) in per

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮