How to include ё in [а-я] regexp char interval

2020-03-08 03:44发布

Russian alphabet includes the letter ё, which was undeservedly forgotten at beggining of computing.

So, if i want to use a regexp with character diapason, I must mention this letter separately:

[а-яА-яёЁ]

instead of:

[а-яА-Я]

example:

lets we have string "Верёвочка - 12" and need to parse only word by regular expression:

word = "Верёвочка"[/а-яА-Я/]   # => ""
word = "Верёвочка"[/а-яА-ЯёЁ/] # => "Верёвочка"

How can I upgrade regexp class in Ruby or Ruby on Rails to resolve this problem?

3条回答
The star\"
2楼-- · 2020-03-08 03:47

The original /а-яА-Я/ and /а-яА-ЯёЁ/ patterns just match sequences of literal chars, а-яА-Я and а-яА-ЯёЁ strings respectively, since the char ranges are not enclosed with [ and ] that would form a character class. Even if they were, without a quantifier, that would only match a single char that falls within the range(s).

To match a sequence of one or more Russian letters, you need either of:

/[а-яА-ЯёЁ]+/
/[а-яё]+/i

See the Rubular demo

Note that there is NO Unicode category class like \p{Russian}, and \p{Cyrillic} matches all Cyrillic chars, not just the Russian ones. The letter Ёё does not fall into the range between а-я and А-Я and **must be added "manually", see the Unicode table:

enter image description here

And here is the Ruby demo:

s = "Верёвочка - 12"
puts s[/[а-яА-ЯёЁ]+/] # => Верёвочка
puts s[/[а-яё]+/i]    # => Верёвочка
查看更多
Evening l夕情丶
3楼-- · 2020-03-08 03:56

This is cool - I had never thought that much about character ranges in unicode.

It seems that for some reason А-я were encoded in the unicode range 0x410 to 0x44f, but some other characters (such as ё) were added in 0x400 to 0x410 and then 0x450 to 0x45f (wikipedia has a full breakdown of what characters went where)

As a consequence, /[Ѐ-ё]/ should work, but might feel quite illogical to a native speaker.

You can of course do raw unicode escapes, i.e. /[\u0400-\u045f]/ (or up until \u04ff if you want the full cyrillic block) but that does make you either remember that (or assign it to some constant for future use).

Lastly, you can refer to entire scripts with

/\p{Cyrillic}/

although my understanding is that this includes more characters, such as Ԧ

查看更多
Fickle 薄情
4楼-- · 2020-03-08 04:09

Is one, but not are beatifull decision: use [/а-ё/] instead of [/а-яё/]. This worked, but letter not in proper direction:

str = "верёвочка"
str[/^[а-ё]+$/]
#=> "верёвочка"
查看更多
登录 后发表回答