How to eliminate the ☎ unicode?

2019-06-18 04:48发布

During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.

I used the following regular expressions in Scrapy to eliminate html tags:

pattern = re.compile("<.*?>|&nbsp;|&amp;",re.DOTALL|re.M)

Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:

pattern = re.compile("<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>|&nbsp;|&amp;|\\\\u260e",re.DOTALL|re.M)

None of this worked and I still have \u260e as an output. How can I make this disappear?

3条回答
放我归山
2楼-- · 2019-06-18 04:52

Using Python 2.7.3, the following works fine for me:

import re

pattern = re.compile(u"<.*?>|&nbsp;|&amp;|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)

Output:

u'bla ble  blo'

As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e is now the -- probably -- two bytes used to write that little black phone ☎ (:

Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e, they both match.

查看更多
你好瞎i
3楼-- · 2019-06-18 05:14

If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.

>>> import string                                   
>>> foo = u"Lorum ☎ Ipsum"                          
>>> foo.replace(u'☎', '')                           
u'Lorum  Ipsum'                                     
>>> "".join(s for s in foo if s in string.printable)
u'Lorum  Ipsum'      
查看更多
狗以群分
4楼-- · 2019-06-18 05:14

You may try with BeatifulSoup, as explained here, with something like

soup = BeautifulSoup (html.decode('utf-8', 'ignore'))
查看更多
登录 后发表回答