With this regex:
regex1 = /\z/
the following strings match:
"hello" =~ regex1 # => 5
"こんにちは" =~ regex1 # => 5
but with these regexes:
regex2 = /#$/?\z/
regex3 = /\n?\z/
they show difference:
"hello" =~ regex2 # => 5
"hello" =~ regex3 # => 5
"こんにちは" =~ regex2 # => nil
"こんにちは" =~ regex3 # => nil
What is interfering? The string encoding is UTF-8, and the OS is Linux (i.e., $/
is "\n"
). Are the multibyte characters interfering with $/
? How?
The problem you reported is definitely a bug of the
Regexp
ofRUBY_VERSION #=> "2.0.0"
but already existing in previous 1.9 when the encoding allow multi-byte chars such as__ENCODING__ #=> #<Encoding:UTF-8>
Does not depend on Linux , it's possibile to reproduce the same behavoir in OSX and Windows too.
In the while bug 8210 will be fixed, we can help by isolating and understanding the cases in which the problem occurs. This can also be useful for any workaround when applicable to specific cases.
I understand that the problem occurs when:
\z
.?
The bug may be caused by misunderstandings between the number of bytes and the number of chars that is actually checked by the regular expression engine.
A few examples may help:
TEST 1: where last character:"は" is 3 bytes:
testing for zero or one of ん [3 bytes] before end of string:
when we try with ç [2 bytes]
when test for zero or one of \n [1 bytes]
By results of TEST1 we can assert: if the last multi-byte character of the string is 3 bytes , then the 'zero or one before' test only works when we test for at least 3 bytes (not 3 character) before.
TEST 2: Where last character "ç" is 2 bytes
check for zero or one of ん [3 bytes]"
check for zero or one of é [2 bytes]
test for zero or one of \n [1 bytes]
By results of TEST2 we can assert: if the last multi-byte character of the string is 2 bytes , then the 'zero or one before' test only works when we check for at least 2 bytes (not 2 character) before.
When the multi-byte character is not at the end of the string I found it works correctly.
public gist with my test code available here
In Ruby trunk, the issue has now been accepted as a bug. Hopefully, it will be fixed.
Update: Two patches have been posted in Ruby trunk.