How to pass Regexp.last_match to a block in Ruby

2019-07-01 09:46发布

问题:

Is there any way to pass the last match (practically Regexp.last_match) to a block (iterator) in Ruby?

Here is a sample method as a kind of wrapper of Srring#sub to demonstrate the problem. It accepts both the standard arguments and a block:

def newsub(str, *rest, &bloc)
  str.sub(*rest, &bloc)
end

It works in the standard arguments-only case and it can take a block; however the positional special variable like $1, $2, etc are not usable inside the block. Here are some examples:

newsub("abcd", /ab(c)/, '\1')        # => "cd"
newsub("abcd", /ab(c)/){|m| $1}      # => "d"  ($1 == nil)
newsub("abcd", /ab(c)/){$1.upcase}   # => NoMethodError

The reason the block does not work in the same way as String#sub(/..(.)/){$1} is I suppose something to do with the scope; the special variables $1, $2 etc are local variables (so is Regexp.last_match).

Is there any way to solve this? I would like to make the method newsub work just as String#sub does, in the sense $1, $2, etc are usable in the supplied block.

EDIT: According to some past answers, there may not be a way to achieve this…

回答1:

Here is a way as per the question (Ruby 2). It is not pretty, and is not quite 100% perfect in all aspects, but does the job.

def newsub(str, *rest, &bloc)
  str =~ rest[0]  # => ArgumentError if rest[0].nil?
  bloc.binding.tap do |b|
    b.local_variable_set(:_, $~)
    b.eval("$~=_")
  end if bloc
  str.sub(*rest, &bloc)
end

With this, the result is as follows:

_ = (/(xyz)/ =~ 'xyz')
p $1  # => "xyz"
p _   # => 0

p newsub("abcd", /ab(c)/, '\1')        # => "cd"
p $1  # => "xyz"
p _   # => 0

p newsub("abcd", /ab(c)/){|m| $1}      # => "cd"
p $1  # => "c"
p _                 # => #<MatchData "abc" 1:"c">

v, _ = $1, newsub("efg", /ef(g)/){$1.upcase}
p [v, _]  # => ["c", "G"]
p $1  # => "g"
p Regexp.last_match # => #<MatchData "efg" 1:"g">

In-depth analysis

In the above-defined method newsub, when a block is given, the local variables $1 etc in the caller's thread are (re)set, after the block is executed, which is consistent with String#sub. However, when a block is not given, the local variables $1 etc are not reset, whereas in String#sub, $1 etc are always reset regardless of whether a block is given or not.

Also, the caller's local variable _ is reset in this algorithm. In Ruby's convention, the local variable _ is used as a dummy variable and its value should not be read or referred to. Therefore, this should not cause any practical problems. If the statement local_variable_set(:$~, $~) was valid, no temporary local variables would be needed. However, it is not, in Ruby (as of Version 2.5.1 at least). See a comment (in Japanese) by Kazuhiro NISHIYAMA in [ruby-list:50708].

General background (Ruby's specification) explained

Here is a simple example to highlight Ruby's specification related to this issue:

s = "abcd"
/b(c)/ =~ s
p $1     # => "c"
1.times do |i|
  p s    # => "abcd"
  p $1   # => "c"
end

The special variables of $&, $1, $2, etc, (related, $~ (Regexp.last_match), $' and alike) work in the local scope. In Ruby, a local scope inherits the variables of the same names in the parent scope. In the example above, the variable s is inherited, and so is $1. The do block is yield-ed by 1.times, and the method 1.times has no control over the variables inside the block except for the block parameters (i in the example above; n.b., although Integer#times does not provide any block parameters, to attempt to receive one(s) in a block would be silently ignored).

This means a method that yield-s a block has no control over $1, $2, etc in the block, which are local variables (even though they may look like global variables).

Case of String#sub

Now, let us analyse how String#sub with the block works:

'abc'.sub(/.(.)./){ |m| $1 }

Here, the method sub first performs a Regexp match, and hence the local variables like $1 are automatically set. Then, they (the variables like $1) are inherited in the block, because this block is in the same scope as the method "sub". They are not passed from sub to the block, being different from the block parameter m (which is a matched String, or equivalent to $&).

For that reason, if the method sub is defined in a different scope from the block, the sub method has no control over local variables inside the block, including $1. A different scope means the case where the sub method is written and defined with a Ruby code, or in practice, all the Ruby methods except some of those written not in Ruby but in the same language as used to write the Ruby interpreter.

Ruby's official document (Ver.2.5.1) explains in the section of String#sub:

In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately.

Correct. In practice, the methods that can and do set the Regexp-match-related special variables such as $1, $2, etc are limited to some built-in methods, including Regexp#match, Regexp#=~, Regexp#===,String#=~, String#sub, String#gsub, String#scan, Enumerable#all?, and Enumerable#grep.
Tip 1: String#split seems to reset $~ nil always.
Tip 2: Regexp#match? and String#match? do not update $~ and hence are much faster.

Here is a little code snippet to highlight how the scope works:

def sample(str, *rest, &bloc)
  str.sub(*rest, &bloc)
  $1    # non-nil if matches
end

sample('abc', /(c)/){}  # => "c"
p $1    # => nil

Here, $1 in the method sample() is set by str.sub in the same scope. That implies the method sample() would not be able to (simply) refer to $1 in the block given to it.

I point out the statement in the section of Regular expression of Ruby's official document (Ver.2.5.1)

Using =~ operator with a String and Regexp the $~ global variable is set after a successful match.

is rather misleading, because

  1. $~ is a pre-defined local-scope variable (not global variable), and
  2. $~ is set (maybe nil) regardless of whether the last attempted match is successful or not.

The fact the variables like $~ and $1 are not global variables may be slightly confusing. But hey, they are useful notations, aren't they?