In the following code, k2
is minimally different from k1
. That is, k2
is exactly the same except that it's defined using an interpolation. (That is, I expected it to be exactly the same; Obviously from the result of p k2
it is not.)
v = /[aeiouAEIOUäöüÄÖÜ]/ # vowels
k1 = /[[ßb-zB-Z]&&[^[aeiouAEIOUäöüÄÖÜ]]]/ # consonants defined without interpolation
k2 = /[[ßb-zB-Z]&&[^#{v}]]/ # consonants defined same way, but with interpolation
But as below, using gsub
with k1
works, whereas using it with k2
fails in a way I don't understand.
all_chars = "äöüÄÖÜß"<<('a'..'z').to_a.join<<('A'..'Z').to_a.join
p all_chars # "äöüÄÖÜßabcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
p all_chars.gsub( k1 , '_' ) # "äöüÄÖÜ_a___e___i_____o_____u_____A___E___I_____O_____U_____"
p all_chars.gsub( k2 , '_' ) # "äöüÄÖÜ_abcdefghijklm_o_____u__x__ABCDEFGHIJKLMNOPQRSTUVWXYZ"
p k1 # /[[ßb-zB-Z]&&[^[aeiouAEIOUäöüÄÖÜ]]]/
p k2 # /[[ßb-zB-Z]&&[^(?-mix:[aeiouAEIOUäöüÄÖÜ])]]/
Why doesn't it work? What is (?-mix:...)
? Is there a way to make this work the way I was expecting it to?
I do things like:
keywords = %w[foo bar]
regex = /\b(?:#{ Regexp.union(keywords).source })\b/i
# => /\b(?:foo|bar)\b/i
That's useful when you want to test for the occurrence of multiple sub-strings inside a single string at once.
Interpolating a regex into a string won't necessarily work right. By default, when you do that, Ruby converts the pattern using to_s
, which is not what I want, because I don't want the full string representation of the pattern, flags and all. Using source
returns what I want:
regex = Regexp.union(keywords)
regex # => /foo|bar/
regex.inspect # => "/foo|bar/"
regex.to_s # => "(?-mix:foo|bar)"
regex.source # => "foo|bar"
Use a string to hold those characters and interpolate that into regexes as needed. Ruby is trying to cover some bases with (?mix:)
but it isn't anticipating that the regex is going into a character set inside the other regex.
Background Info
Here's what's really happening:
In many cases, if you interpolate a regex into a regex, it makes sense. Like this
a = /abc/ #/abc/
b = /#{a}#{a}/ #/(?-mix:abc)(?-mix:abc)/
'hhhhabcabchthth'.gsub(/abcabc/, '_') # "hhhh_hthth"
'hhhhabcabchthth'.gsub(b, '_') # "hhhh_hthth"
It works as expected. The whole (?-mix:
thing is a way of encapsulating the rules for a
, just in case b
has different flags. a
is case sensitive, because this is the default. But if b
was set to case insensitive, the only way for a
to continue matching what it matched before is to make sure it is case sensitive using -i
. Anything inside (?-i:)
after the colon will be matched with case sensitivity. This is made more clear by the following
e = /a/i # e is made to be case insensitive with the /i
/#{e}/ # /(?i-mx:a)/
You can see above that when interpolating e
into something, you now have (?i-mx:)
. Now the i
is to the left of the -
, which means it turns case insensitivity on instead of off (temporarily), in order for e
to match as it normally would.
Also, in order to avoid messing up the capture order, (?:
is added in to make an uncaptured group. All of that is a rough attempt to make a
and e
variables match what you expect them to match when you stick them into a larger regex.
Unfortunately, if you put it inside a character set match, meaning []
, this strategy completely fails. [(?-mix:)]
is now interpreted completely differently. [^?-m]
indicates everything that is NOT between "?" and "m" (inclusive), which means, for example, the letter "c" is no longer in your character set. Which means "c" doesn't get replaced with underscore as you see in your example. You can see the same thing happening with the letter "x". It also doesn't get replaced with a underscore, because it is within the negated character set, and therefore not in the characters being matched.
Ruby doesn't bother to parse the regular expression to figure out that you're interpolating your regular expression into a character set, and even if it did, it would still have to parse out the v
variable to figure out that it is also a character set, and that therefore all you really want is to take the characters from the character set in v
and put them with all the other characters there.
My advice is that since aeiouAEIOUäöüÄÖÜ
is just a bunch of characters anyway, you can store it in a string and interpolate that into any character set in a regular expression. And be careful about interpolating a regex into a regex in the future. Avoid it unless you are really certain about what it's going to do.
Answer I'm using:
If you want to interpolate some_regex
into another one, use regex1.inspect[1...-1]
inside the #{}
.
Eg, taking my original example, this way of defining consonants using an interpolation works.
v = /[aeiouAEIOUäöüÄÖÜ]/ # vowels
k3 = /[[ßb-zB-Z]&&[^#{v.inspect[1...-1]}]]/ # consonants
(I don't know if there's some sort of built-in way to accomplish the same function as .inspect[1...-1]
for regexes.
I was surprised that that's not already how .to_s
works for regexes.
I'm still not sure what "(?-mix:
some_regex)"
is for.)
Your statement "k2
is exactly the same except that it's defined using an interpolation" is wrong.
When you interpolate something that is not a string, such as regex v
, it is casted to a string with to_s
.
v = /[aeiouAEIOUäöüÄÖÜ]/
v.to_s # => "(?-mix:[aeiouAEIOUäöüÄÖÜ])"
This is interpolated into k2
, resulting in a different regex from k1
. If you want k2
to be the same as k1
, you need to interpolate a string:
v = "[aeiouAEIOUäöüÄÖÜ]"