Python Character Encoding European Accents

2019-08-10 16:03发布

问题:

I know this is not an uncommon problem and that there are already multiple SO questions answered about this (1, 2, 3) but even in following the recommendations there, I am still seeing this error (for the below code):

uri_name = u"%s_%s" % (name[1].encode('utf-8').strip(), name[0].encode('utf-8').strip()) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

So I am trying to get a url from a list of artist names, a lot of which have accents and european characters like so (with their names also printed with the special characters via repr):

Auberjonois, René -> Auberjonois, Ren\xc3\xa9
Bäumer, Eduard -> B\xc3\xa4umer, Eduard
Baur-Nütten, Gisela -> Baur-N\xc3\xbctten, Gisela
Bösken, Lorenz -> B\xc3\xb6sken, Lorenz
Čapek, Josef -> \xc4\x8capek, Josef
Großmann, Rudolf -> Gro\xc3\x9fmann, Rudolf

The block I am trying to run is:

def create_uri(artist_name):

  artist_name = artist_name

  name = artist_name.split(",")

  uri_name = u"%s_%s" % (name[1].encode('utf-8').strip(), name[0].encode('utf-8').strip())

  uri = 'http://example.com/' + uri_name

  print uri

create_uri('Name, Non_Accent')
create_uri('Auberjonois, René')

So the first one works and produces http://example.com/Non_Accent_Name But the second fails with the error above.

I have added # coding=utf-8 to the top of my script and have tried encoding the artist_name string at every point along the way, only to get the same error each time.

If it matters, I am using Atom as a text editor and when I open up the .csv file from where these names are coming from, the accents all display correctly.

What else can I do to ensure that the script interprets UTF-8 as UTF-8 and not ascii?

回答1:

Stop using UTF-8. Use unicodes everywhere, and only decode/encode (if necessary) at interfaces.

def create_uri(artist_name):
  name = artist_name.split(u",")
  uri_name = u"%s_%s" % (name[1].strip(), name[0].strip())
  uri = u'http://example.com/' + uri_name
  print uri

create_uri(u'Name, Non_Accent')
create_uri(u'Auberjonois, René')


回答2:

As I can see by print statement, you are using python 2.x. That means you should define unicode chars via \u notation or use a u prefix for string. So, just change your line to

create_uri(u'Auberjonois, René') # note the u''

Also it looks like you don't need .encode for your parts after splitting - it's already unicode