I'm writing a script that goes to a list of links and parses the information.
It works for most sites but It's choking on some with "UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)"
It stops on client.py which is part of urlib on python3
the exact link is: http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html
There are quite a few similar postings here but none of the answers seems to work for me.
my code is:
from urllib import request
def __request(link,debug=0):
try:
html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
unicode_html = html.decode('utf-8','ignore')
# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
if debug:
print('The server couldn\'t fulfill the request for ' + link)
print('Error code: ', e.code)
return ''
except URLError as e:
if isinstance(e.reason, socket.timeout):
print('timeout')
return ''
else:
return unicode_html
this calls the request function
link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html' page = __request(link)
And the traceback is:
Traceback (most recent call last):
File "<string>", line 250, in run_nodebug
File "C:\reader\get_news.py", line 276, in <module>
main()
File "C:\reader\get_news.py", line 255, in main
body = get_article_body(item['link'],debug=0)
File "C:\reader\get_news.py", line 155, in get_article_body
page = __request('na',url)
File "C:\reader\get_news.py", line 50, in __request
html = request.urlopen(link, timeout=35).read()
File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\Lib\urllib\request.py", line 469, in open
response = self._open(req, data)
File "C:\Python33\Lib\urllib\request.py", line 487, in _open
'_open', req)
File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Python33\Lib\http\client.py", line 1061, in request
self._send_request(method, url, body, headers)
File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
self.putrequest(method, url, **skips)
File "C:\Python33\Lib\http\client.py", line 953, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)
Any help appreciated It's driving me crazy , I think I've tried all combinations of x.decode and similar
(I could ignore the offending characters if that is possible.)
Use a percent-encoded URL:
I found the above percent-encoded URL by pointing the browser at
going to the page, then copying-and-pasting the encoded url supplied by the browser back into the text editor. However, you can generate a percent-encoded URL programmatically using:
which yields
Your URL contains characters that cannot be represented as ASCII characters.
You'll have to ensure that all characters have been properly URL encoded; use
urllib.parse.quote_plus
for example; it'll use UTF-8 URL-encoded escaping to represent any non-ASCII characters.