how to remove hashtag, @user, link of a tweet usin

I need to preprocess tweets using Python. Now I am wondering what would be the regular expression to remove all the hashtags, @user and links of a tweet respectively?

for example,

original tweet: @peter I really love that shirt at #Macy. http://bet.ly//WjdiW4
- processed tweet: I really love that shirt at Macy
original tweet: @shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx
- processed tweet: Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve
original tweet: I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)
- processed tweet: I am at Starbucks 7419 3rd ave at 75th Brooklyn

I just need the meaningful words in each Tweet. I don't need the username, or any links or any punctuations.

标签： python regex twitter

4条回答

Juvenile、少年°

2楼-- · 2019-03-08 09:29

I know it's not a regex but:

>>>
>>> import urlparse
>>> string = '@peter I really love that shirt at #Macy. http://bit.ly//WjdiW#'
>>> new_string = ''
>>> for i in string.split():
...     s, n, p, pa, q, f = urlparse.urlparse(i)
...     if s and n:
...         pass
...     elif i[:1] == '@':
...         pass
...     elif i[:1] == '#':
...         new_string = new_string.strip() + ' ' + i[1:]
...     else:
...         new_string = new_string.strip() + ' ' + i
...
>>> new_string
'I really love that shirt at Macy.'
>>>

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2019-03-08 09:37

~~This will work with your examples. If you have links inside your tweets, it will fail, miserably.~~

result = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", subject)

Edit:

~~works with internal links too, as long as they are separated by a space.~~

Just go with the API. Why reinvent the wheel?

0人赞添加讨论(0) 举报

傲

4楼-- · 2019-03-08 09:39

A little bit late, but this solution prevent punctuation mistakes like #hashtag1,#hashtag2 (without spaces), and implementation is very simple

import re,string

def strip_links(text):
    link_regex    = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links         = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')    
    return text

def strip_all_entities(text):
    entity_prefixes = ['@','#']
    for separator in  string.punctuation:
        if separator not in entity_prefixes :
            text = text.replace(separator,' ')
    words = []
    for word in text.split():
        word = word.strip()
        if word:
            if word[0] not in entity_prefixes:
                words.append(word)
    return ' '.join(words)


tests = [
    "@peter I really love that shirt at #Macy. http://bet.ly//WjdiW4",
    "@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bet.ly/tuN2wx",
    "I am at Starbucks http://4sh.com/samqUI (7419 3rd ave, at 75th, Brooklyn)",
]
for t in tests:
    strip_all_entities(strip_links(t))


#'I really love that shirt at'
#'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
#'I am at Starbucks 7419 3rd ave at 75th Brooklyn'

0人赞添加讨论(0) 举报

傲

5楼-- · 2019-03-08 09:41

The following example is a close approximation. Unfortunately there is no right way to do it just via regular expression. The following regex just strips of an URL (not just http), any punctuations, User Names or Any non alphanumeric characters. It also separates the word with a single space. If you want to parse the tweet as you are intending you need more intelligence in the system. Some precognitive self learning algorithm considering there is no standard tweet feed format.

Here is what I am proposing.

' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())

and here is the result on your examples

>>> x="@peter I really love that shirt at #Macy. http://bit.ly//WjdiW4"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I really love that shirt at Macy'
>>> x="@shawn Titanic tragedy could have been prevented Economic Times: Telegraph.co.ukTitanic tragedy could have been preve... http://bit.ly/tuN2wx"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'Titanic tragedy could have been prevented Economic Times Telegraph co ukTitanic tragedy could have been preve'
>>> x="I am at Starbucks http://4sq.com/samqUI (7419 3rd ave, at 75th, Brooklyn) "
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I am at Starbucks 7419 3rd ave at 75th Brooklyn'
>>>

and here are few examples where it is not perfect

>>> x="I c RT @iamFink: @SamanthaSpice that's my excited face and my regular face. The expression never changes."
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'I c RT that s my excited face and my regular face The expression never changes'
>>> x="RT @AstrologyForYou: #Gemini recharges through regular contact with people of like mind, and social involvement that allows expression of their ideas"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT Gemini recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> # Though after you add # to the regex expression filter, results become a bit better
>>> ' '.join(re.sub("([@#][A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'RT recharges through regular contact with people of like mind and social involvement that allows expression of their ideas'
>>> x="New comment by diego.bosca: Re: Re: wrong regular expression? http://t.co/4KOb94ua"
>>> ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split())
'New comment by diego bosca Re Re wrong regular expression'
>>> #See how miserably it performed?
>>>

0人赞添加讨论(0) 举报

how to remove hashtag, @user, link of a tweet usin

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间