What is the proper regular expression to match all

2020-05-31 04:29发布

问题:

I would like to match all lowercase letter forms in the Latin block. The trivial '[a-z]' only matches characters between U+0061 and U+007A, and not all the other lowercase forms.

I would like to match all lowercase letters, most importantly, all the accented lowercase letters in the Latin block used in EFIGS languages.

[a-zà-ý] is a start, but there are still tons of other lowercase characters (see http://www.unicode.org/charts/PDF/U0000.pdf). Is there a recommended way of doing this?

FYI I'm using Python, but I suspect that this problem is cross-language.

Python's builtin "islower()" method seems to do the right checking:

lower = ''
for c in xrange(0,2**16): 
  if unichr(c).islower(): 
    lower += unichr(c)

print lower 

回答1:

Python does not currently support Unicode properties in regular expressions. See this answer for a link to the Ponyguruma library which does support them.

Using such a library, you could use \p{Ll} to match any lowercase letter in a Unicode string.

Every character in the Unicode standard is in exactly one category. \p{Ll} is the category of lowercase letters, while \p{L} comprises all the characters in one of the "Letter" categories (Letter, uppercase; Letter, lowercase; Letter, titlecase; Letter, modifier; and Letter, other). For more information see the Character Properties chapter of the Unicode Standard. Or see this page for a good explanation on use of Unicode in regular expressions.



回答2:

Looks as though this recipe posted back in the old 2005

import sys, re

uppers = [u'['] 
for i in xrange(sys.maxunicode): 
  c = unichr(i) 
  if c.isupper(): uppers.append(c) 
uppers.append(u']') 
uppers = u"".join(uppers) 
uppers_re = re.compile(uppers) 

print uppers_re.match('A')

is still relevant.



回答3:

You might want to have a look at regular-expressions.info.

However, as far as I know there's no character class or modifier that expresses "lower case characters only" (and not every language has lower case characters), so I'd say you might have to use multiple ranges (possible almost as many as there are unicode blocks.

Edit: reading a bit more on this, there might be a way: [\p{Ll}\p{Lo}] which means lowercase characters with an upper case variant or characters that don't have lower case and upper case (in case of chinese characters for example).

Regex [\p{Ll}\p{Lo}]+ matches test string àÀhelloHello你好Прывітанне and replacing the matches with x results in xÀxHxПx whereas replacing the matches of [\p{Ll}]+ results in xÀxHx你好Пx (note the Chinese characters that were not matched).



回答4:

if you use \p{L} it will match any unicode letter. check the examples here. You can also combine it with \p{M} to match Hebrew-esqe languages that include diacritic marks. (\p{L}|\p{M})+

EDIT:

I missed the part about only lowercase letters the first time around. \p{L} will match all letters, \p{Ll} will match lowercase only.