I want to change different charaters/substrings to a single character or nil
. I want to change "How to chop an onion?"
to "how-chop-onion"
.
string
.gsub(/'s/,'')
.gsub(/[?&]/,'')
.gsub('to|an|a|the','')
.split(' ')
.map { |s| s.downcase}
.join '-'
Using pipe character |
does not work. How can I do this with gsub
?
to|an|a|the
is pattern, you are using it as String. Here:
str.gsub('to|an|a|the', '') # passing string argument
#=> "How to chop an onion?"
str.gsub(/to|an|a|the/, '') # passing pattern argument
#=> "How chop onion?"
▶ "How to chop an onion?".gsub(/'s|[?&]+|to|an|a|the/,'')
.downcase.split(/\s+/).join '-'
#⇒ "how-chop-onion"
Start by making a list of what you want to do:
- Remove certain words
- Remove certain punctuation
- Remove extra spaces after words are removed
- Convert to lower case1
Now think about the order in which these operations should be performed. The conversion to lower case can be done anytime, but it's convenient to do it first, in which case the regex need not be case-indifferent. Punctuation should be removed before certain words, to more easily identify words as opposed to substrings. Removing the extra spaces obviously must be done after words are removed. We therefore want the order to be:
- Convert to lower case
- Remove certain punctuation
- Remove certain words
- Remove extra spaces after words are removed
After down-casing, this could be done with three chained gsub
s:
str = "Please, don't any of you know how to chop an avacado?"
r1 = /[,?]/ # match a comma or question mark
r2 = /
\b # match a word break
(?: # start a non-capture group
to|an|a|the # match one of these words (checking left to right)
) # end non-capture group
\b # match a word break
/x # extended/free-spacing regex definition mode
r3 = /\s\s/ # match two whitespace characters
str.downcase.gsub(r1,'').gsub(r2,'').gsub(r3,' ')
#=> "please don't any of you know how chop avacado"
Note that without the word breaks (\b
) in r2
we would get:
"plese don't y of you know how chop vcdo"
Also, the first gsub
could be replaced by:
tr(',?','')
or:
delete(',?')
These gsub
s can be combined into one (how I'd write it), as follows:
r = /
[,?] # as in r1
| # or
\b(?:to|an|a|the)\b # as in r2
| # or
\s # match a whitespace char
(?=\s) # match a whitespace char in a postive lookahead
/x
str.downcase.gsub(r,'')
#=> "please don't any of you know how chop avacado"
"Lookarounds" (here a positive lookahead) are often referred to as "zero-width", meaning that, while the match is required, they do not form part of the match that is returned.
1 Have you ever wondered where the terms "lower case" and "upper case" came from? In the early days of printing, typesetters kept the metal movable type in two cases, one located above the other. Those for the taller letters, used to begin sentences and proper nouns, were in the upper case; the remaining ones were in the lower case.