I need to preprocess tweets using Python. Now I am wondering what would be the regular expression to remove all the hashtags, @user and links of a tweet respectively?
for example,
original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
- processed tweet:
I really love that shirt at Macy
- processed tweet:
- original tweet:
@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
- processed tweet:
Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
- processed tweet:
- original tweet:
I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
- processed tweet:
I am at Starbucks 7419 3rd ave at 75th Brooklyn
- processed tweet:
I just need the meaningful words in each Tweet. I don't need the username, or any links or any punctuations.
I know it's not a regex but:
This will work with your examples. If you have links inside your tweets, it will fail, miserably.Edit:
works with internal links too, as long as they are separated by a space.
Just go with the API. Why reinvent the wheel?
A little bit late, but this solution prevent punctuation mistakes like #hashtag1,#hashtag2 (without spaces), and implementation is very simple
The following example is a close approximation. Unfortunately there is no right way to do it just via regular expression. The following regex just strips of an URL (not just http), any punctuations, User Names or Any non alphanumeric characters. It also separates the word with a single space. If you want to parse the tweet as you are intending you need more intelligence in the system. Some precognitive self learning algorithm considering there is no standard tweet feed format.
Here is what I am proposing.
and here is the result on your examples
and here are few examples where it is not perfect