I know this is not an uncommon problem and that there are already multiple SO questions answered about this (1, 2, 3) but even in following the recommendations there, I am still seeing this error (for the below code):
uri_name = u"%s_%s" % (name[1].encode('utf-8').strip(), name[0].encode('utf-8').strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
So I am trying to get a url from a list of artist names, a lot of which have accents and european characters like so (with their names also printed with the special characters via repr
):
Auberjonois, René -> Auberjonois, Ren\xc3\xa9
Bäumer, Eduard -> B\xc3\xa4umer, Eduard
Baur-Nütten, Gisela -> Baur-N\xc3\xbctten, Gisela
Bösken, Lorenz -> B\xc3\xb6sken, Lorenz
Čapek, Josef -> \xc4\x8capek, Josef
Großmann, Rudolf -> Gro\xc3\x9fmann, Rudolf
The block I am trying to run is:
def create_uri(artist_name):
artist_name = artist_name
name = artist_name.split(",")
uri_name = u"%s_%s" % (name[1].encode('utf-8').strip(), name[0].encode('utf-8').strip())
uri = 'http://example.com/' + uri_name
print uri
create_uri('Name, Non_Accent')
create_uri('Auberjonois, René')
So the first one works and produces http://example.com/Non_Accent_Name
But the second fails with the error above.
I have added # coding=utf-8
to the top of my script and have tried encoding the artist_name
string at every point along the way, only to get the same error each time.
If it matters, I am using Atom as a text editor and when I open up the .csv file from where these names are coming from, the accents all display correctly.
What else can I do to ensure that the script interprets UTF-8 as UTF-8 and not ascii?
As I can see by print statement, you are using python 2.x. That means you should define unicode chars via
\u
notation or use au
prefix for string. So, just change your line toAlso it looks like you don't need .encode for your parts after splitting - it's already unicode
Stop using UTF-8. Use
unicode
s everywhere, and only decode/encode (if necessary) at interfaces.