Regex to match hashtags in a sentence using ruby

2019-03-13 21:08发布

I am trying to extract hashtags for a simple college project using ruby on rails. I am facing issue with tags that include only numericals and with tags with no space.

text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"

The regex I have is /(?:^|\s)#(\w+)/i (source)

This regex returns #["box", "5", "2good", "first"]

How to make sure it only returns #["box", "2good"] and ignore the rest as they are not 'real' hashtags?

3条回答
ら.Afraid
2楼-- · 2019-03-13 21:29

Try this:

/\s#([[\d]]?[[a-z]]+\s)/i

Output:

1.9.3-p194 :010 > text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
 => "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" 
1.9.3-p194 :011 > puts text.scan /\s#([[\d]]?[[a-z]]+\s)/i 
box 
2good 
 => nil
查看更多
叛逆
3楼-- · 2019-03-13 21:30

Can you try this regex:

/(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i

UPDATE 1:
There are a few cases where the above regex will not match like: #blah23blah and #23blah23. Hence modified the regex to take care of all cases.

Regex:

/(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i

Breakdown:

  • (?:\s|^) --Matches the preceding space or start of line. Does not capture the match.
  • # --Matches hash but does not capture.
  • (?!\d+(?:\s|$))) --Negative Lookahead to avoid ALL numeric characters between # and space (or end of line)
  • (\w+) --Matches and captures all word characters
  • (?=\s|$) --Positive Lookahead to ensure following space or end of line. This is required to ensure it matches adjacent valid hash tags.

Sample text modified to capture most cases:

#blah Pack my #box with #5 dozen #good2 #3good liquor.#jugs link.com/liquor#jugs #mkvef214asdwq sd #3e4 flsd #2good #first#second #3

Matches:

Match 1: blah
Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2good

Rubular link

UPDATE 2:

To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:

/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i

The sample, regex and matches are recorded in this Rubular link

查看更多
劫难
4楼-- · 2019-03-13 21:46

I'd go about it this way:

text.scan(/ #[[:digit:]]?[[:alpha:]]+ /).map{ |s| s.strip[1..-1] }

which returns:

[
    [0] "box",
    [1] "2good"
]

I don't try to do everything in a regex. I prefer to keep them as simple as possible, then filter and mutilate once I've gotten the basic data captured. My reasoning is that regex are more difficult to maintain the more complex they become. I'd rather spend my time doing something else than maintaining patterns.

查看更多
登录 后发表回答