How to get rid of special characters while extract

2019-05-29 09:39发布

I am extracting data from the website and it has an entry that contains a special character i.e. Comfort Inn And Suites�? Blazing Stump. When I try to extract it, it throws an error:

    Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 638, in _tick
    taskObj._oneWorkUnit()
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 484, in _oneWorkUnit
    result = next(self._iterator)
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 57, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\scrapy\utils\defer.py", line 96, in iter_errback
    yield it.next()
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\offsite.py", line 24, in process_spider_output
    for x in result:
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\referer.py", line 14, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\urllength.py", line 32, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Python27\lib\site-packages\scrapy\contrib\spidermiddleware\depth.py", line 48, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\Scrapy projects\emedia\emedia\spiders\test_spider.py", line 46, in parse
    print repr(business.select('a[@class="name"]/text()').extract()[0])
  File "C:\Python27\lib\site-packages\scrapy\selector\lxmlsel.py", line 51, in select
    result = self.xpathev(xpath)
  File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src\lxml\lxml.etree.c:145954)

  File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src\lxml\lxml.etree.c:144987)

  File "extensions.pxi", line 621, in lxml.etree._unwrapXPathObject (src\lxml\lxml.etree.c:139973)

  File "extensions.pxi", line 655, in lxml.etree._createNodeSetResult (src\lxml\lxml.etree.c:140328)

  File "extensions.pxi", line 676, in lxml.etree._unpackNodeSetEntry (src\lxml\lxml.etree.c:140524)

  File "extensions.pxi", line 784, in lxml.etree._buildElementStringResult (src\lxml\lxml.etree.c:141695)

  File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src\lxml\lxml.etree.c:26255)

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 22: invalid continuation byte

I have tried a lot of different things after searching on the web such as decode('utf-8'), unicodedata.normalize('NFC',business.select('a[@class="name"]/text()').extract()[0]) but the problem persists?

The source URL is "http://www.truelocal.com.au/find/hotels/97/" and on this page it is fourth entry which I am talking about.

标签: python scrapy
2条回答
甜甜的少女心
2楼-- · 2019-05-29 10:06

Don't use "replace" to fix Mojibake, fix the database and the code that caused the Mojibake.

But first you need to determine whether it is simply Mojibake or "double-encoding". With a SELECT col, HEX(col) ... determine whether a single character turned into 2-4 bytes (Mojibake) or 4-6 bytes (double encoding). Examples:

`é` (as utf8) should come back `C3A9`, but instead shows `C383C2A9`
The Emoji `                                                                    
查看更多
Viruses.
3楼-- · 2019-05-29 10:11

You have a bad Mojibake in the original webpage, probably due to bad handling of Unicode in the data entry somewhere. The actual UTF-8 bytes in the source are C3 3F C2 A0 when expressed in hexadecimal.

I think it was once a U+00A0 NO-BREAK SPACE. Encoded to UTF-8 that becomes C2 A0, interpret that as Latin-1 instead then encode to UTF-8 again becomes C3 82 C2 A0, but 82 is a control character if interpreted as Latin-1 again so that was substituted by a ? question mark, hex 3F when encoded.

When you follow the link to the detail page for that venue then you get a different Mojibake for the same name: Comfort Inn And Suites Blazing Stump, giving us the Unicode characters U+00C3, U+201A, U+00C2 a &nbsp; HTML entity, or unicode character U+00A0 again. Encode that as Windows Codepage 1252 (a superset of Latin-1) and you get C3 82 C2 A0 again.

You can only get rid of it by targeting this directly in the source of the page

pagesource.replace('\xc3?\xc2\xa0', '\xc2\xa0')

This 'repairs' the data by substituting the train wreck with the original intended UTF-8 bytes.

If you have a scrapy Response object, replace the body:

body = response.body.replace('\xc3?\xc2\xa0', '\xc2\xa0')
response = response.replace(body=body)
查看更多
登录 后发表回答