How do I unescape HTML entities in a string in Pyt

I have looked all around and only found solutions for python 2.6 and earlier, NOTHING on how to do this in python 3.X. (I only have access to Win7 box.)

I HAVE to be able to do this in 3.1 and preferably without external libraries. Currently, I have httplib2 installed and access to command-prompt curl (that's how I'm getting the source code for pages). Unfortunately, curl does not decode html entities, as far as I know, I couldn't find a command to decode it in the documentation.

YES, I've tried to get Beautiful Soup to work, MANY TIMES without success in 3.X. If you could provide EXPLICIT instructions on how to get it to work in python 3 in MS Windows environment, I would be very grateful.

So, to be clear, I need to turn strings like this: Suzy & John into a string like this: "Suzy & John".

标签： python html curl python-3.x entities

6条回答

We Are One

2楼-- · 2020-01-23 03:56

You can use xml.sax.saxutils.unescape for this purpose. This module is included in the Python standard library, and is portable between Python 2.x and Python 3.x.

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'

0人赞添加讨论(0) 举报

干净又极端

3楼-- · 2020-01-23 04:00

Python 3.x has html.entities too

0人赞添加讨论(0) 举报

干净又极端

4楼-- · 2020-01-23 04:02

Apparently I don't have a high enough reputation to do anything but post this. unutbu's answer does not unescape quotations. The only thing that I found that did was this function:

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]

Which I got from this page.

0人赞添加讨论(0) 举报

爷、活的狠高调

5楼-- · 2020-01-23 04:04

In my case I have a html string escaped in as3 escape function. After a hour of googling haven't found anything useful so I wrote this recusrive function to serve for my needs. Here it is,

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1 Added functionality to handle unicode characters.

0人赞添加讨论(0) 举报

SAY GOODBYE

6楼-- · 2020-01-23 04:08

You could use the function html.unescape:

In Python3.4+ (thanks to J.F. Sebastian for the update):

import html
html.unescape('Suzy &amp; John')
# 'Suzy & John'

html.unescape('&quot;')
# '"'

In Python3.3 or older:

import html.parser    
html.parser.HTMLParser().unescape('Suzy &amp; John')

In Python2:

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

0人赞添加讨论(0) 举报

唯我独甜

7楼-- · 2020-01-23 04:13

I am not sure if this is a built in library or not but it looks like what you need and supports 3.1.

From: http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape(data, entities={}) Unescape '&', '<', and '>' in a string of data.

0人赞添加讨论(0) 举报

How do I unescape HTML entities in a string in Pyt

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间