Get Twitter @Username with Regex in R

2020-01-29 16:33发布

How can I use regex in R to extract Twitter usernames from a string of text?

I've tried

library(stringr)

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'

str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')

But I end up with @foobar, @foo and (@bar which contains an unwanted parenthesis.

How can I get just @foobar, @foo and @bar as output?

标签: regex r twitter
3条回答
Ridiculous、
2楼-- · 2020-01-29 16:55

Try using a negative lookbehind so that characters are not consumed in your match:

(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)
      ^^^

EDIT: Since it seems lookbehinds don't work in R (I found somewhere here that lookbehinds worked on R, but apparently not...), try this one:

@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)

Edit: double escaped the dot

EDITv3... : Try turning on PCRE:

str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)")
查看更多
时光不老,我们不散
3楼-- · 2020-01-29 17:04

Here's one method that works in R:

theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^@\\w])@(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)"

If you want to use @Jerry's answer in R:

regex <- "@([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "@foobar" "@foo"    "(@bar)" 

Both of these methods include the parenthesis that you don't want, however.

UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)

theString <- '@foobar Foobar! and @fo_o (@bar) but not foo@bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^@\\w])@(\\w{1,15})\\b" # get strings with @
regex2 <- "[^[:alnum:]@_]"             # remove all punctuation except _ and @
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users

[1] "@foobar" "@fo_o"   "@bar"
查看更多
Animai°情兽
4楼-- · 2020-01-29 17:13

@[a-zA-Z0-9_]{0,15}

Where:

  • @ matches the character @ literally (case sensitive).

  • [a-zA-Z0-15] match a single character present in the list

  • {0,15} Quantifier matches between 0 and 15 times, as many times as possible, giving back as needed

It is working fine on selecting twitter usernames from a mixed dataset.

查看更多
登录 后发表回答