Regex to match hashtags in a sentence using ruby

I am trying to extract hashtags for a simple college project using ruby on rails. I am facing issue with tags that include only numericals and with tags with no space.

text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"

The regex I have is /(?:^|\s)#(\w+)/i (source)

This regex returns #["box", "5", "2good", "first"]

How to make sure it only returns #["box", "2good"] and ignore the rest as they are not 'real' hashtags?

标签： ruby regex twitter hashtag

3条回答

ら.Afraid

2楼-- · 2019-03-13 21:29

Try this:

/\s#([[\d]]?[[a-z]]+\s)/i

Output:

1.9.3-p194 :010 > text = "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second"
 => "Pack my #box with #5 dozen liquor.#jugs link.com/liquor#jugs #2good #first#second" 
1.9.3-p194 :011 > puts text.scan /\s#([[\d]]?[[a-z]]+\s)/i 
box 
2good 
 => nil

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2019-03-13 21:30

Can you try this regex:

/(?:^|\s)(?:(?:#\d+?)|(#\w+?))\s/i

UPDATE 1:
There are a few cases where the above regex will not match like: #blah23blah and #23blah23. Hence modified the regex to take care of all cases.

Regex:

/(?:\s|^)(?:#(?!\d+(?:\s|$)))(\w+)(?=\s|$)/i

Breakdown:

(?:\s|^) --Matches the preceding space or start of line. Does not capture the match.
# --Matches hash but does not capture.
(?!\d+(?:\s|$))) --Negative Lookahead to avoid ALL numeric characters between # and space (or end of line)
(\w+) --Matches and captures all word characters
(?=\s|$) --Positive Lookahead to ensure following space or end of line. This is required to ensure it matches adjacent valid hash tags.

Sample text modified to capture most cases:

#blah Pack my #box with #5 dozen #good2 #3good liquor.#jugs link.com/liquor#jugs #mkvef214asdwq sd #3e4 flsd #2good #first#second #3

Matches:

Match 1: blah
Match 2: box
Match 3: good2
Match 4: 3good
Match 5: mkvef214asdwq
Match 6: 3e4
Match 7: 2good

Rubular link

UPDATE 2:

To exclude words starting or ending with underscore, just include your exclusions in the negative lookahead like this:

/(?:\s|^)(?:#(?!(?:\d+|\w+?_|_\w+?)(?:\s|$)))(\w+)(?=\s|$)/i

The sample, regex and matches are recorded in this Rubular link

0人赞添加讨论(0) 举报

劫难

4楼-- · 2019-03-13 21:46

I'd go about it this way:

text.scan(/ #[[:digit:]]?[[:alpha:]]+ /).map{ |s| s.strip[1..-1] }

which returns:

[
    [0] "box",
    [1] "2good"
]

I don't try to do everything in a regex. I prefer to keep them as simple as possible, then filter and mutilate once I've gotten the basic data captured. My reasoning is that regex are more difficult to maintain the more complex they become. I'd rather spend my time doing something else than maintaining patterns.

0人赞添加讨论(0) 举报

Regex to match hashtags in a sentence using ruby

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间