I have a csv file something like this
text
RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://…
#CRISPR Inversion of CTCF Sites Alters Genome Topology & Enhancer/Promoter Function in @CellCellPress htp://.co/HrjDwbm7NN
RT @gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail
RT @sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology
RT @MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via @nucAmbiguous htp://…
I want to extract all the mentions (starting with '@') from the tweet text. So far I have done this
import pandas as pd
import re
mydata = pd.read_csv("C:/Users/file.csv")
X = mydata.ix[:,:]
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text'
for i in range(X.shape[0]):
result = re.findall("(^|[^@\w])@(\w{1,25})", str(X.iloc[:i,:]))
print(result);
There are two problems here:
First: at str(X.iloc[:1,:])
it gives me ['CritCareMed']
which is not ok as it should give me ['CellCellPress']
, and at str(X.iloc[:2,:])
it again gives me ['CritCareMed']
which is of course not fine again. The final result I'm getting is
[(' ', 'CritCareMed'), (' ', 'gvwilson'), (' ', 'sciencemagazine')]
It doesn't include the mentions in 2nd row and both two mentions in last row. What I want should look something like this:
How can I achieve these results? this is just a sample data my original data has lots of tweets so is the approach ok?
Same as this: Extract hashtags from columns of a pandas dataframe, but for mentions.
@.*?
carries out a non-greedy match for a word starting with a hashtag(?=\s|$)
look-ahead for the end of the word or end of the sentence(?:(?<=\s)|(?<=^))
look-behind to ensure there are no false positives if a @ is used in the middle of a wordThe regex lookbehind asserts that either a space or the start of the sentence must precede a @ character.
While you already have your answer, you could even try to optimize the whole import process like so:
Which yields:
This might be a bit faster as you don't need to change the
df
once it's already constructed.You can use
str.findall
method to avoid the for loop, use negative look behind to replace(^|[^@\w])
which forms another capture group you don't need in your regex:Also
X.iloc[:i,:]
gives back a data frame, sostr(X.iloc[:i,:])
gives you the string representation of a data frame, which is very different from the element in the cell, to extract the actual string from thetext
column, you can useX.text.iloc[0]
, or a better way to iterate through a column, useiteritems
: