How can I use regex in R to extract Twitter usernames from a string of text?
I've tried
library(stringr)
theString <- '@foobar Foobar! and @foo (@bar) but not foo@bar.com'
str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))@([A-Za-z]+[A-Za-z0-9_]+)')
But I end up with @foobar
, @foo
and (@bar
which contains an unwanted parenthesis.
How can I get just @foobar
, @foo
and @bar
as output?
Try using a negative lookbehind so that characters are not consumed in your match:
EDIT: Since it seems lookbehinds don't work in R (I found somewhere here that lookbehinds worked on R, but apparently not...), try this one:
Edit: double escaped the dot
EDITv3... : Try turning on PCRE:
Here's one method that works in
R
:If you want to use @Jerry's answer in
R
:Both of these methods include the parenthesis that you don't want, however.
UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)
@[a-zA-Z0-9_]{0,15}
Where:
@
matches the character@
literally (case sensitive).[a-zA-Z0-15]
match a single character present in the list{0,15}
Quantifier matches between 0 and 15 times, as many times as possible, giving back as neededIt is working fine on selecting twitter usernames from a mixed dataset.