UnicodeEncodeError when writing to file

2019-04-14 06:45发布

I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (i.e. > myFile.txt), and both work fine.

However, on the server (ssh), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128). All files are in utf-8 encoding, and utf-8 is declared in the magic comment.

If I print the str objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.

If I use print( x.encode('utf-8') ), then it prints code-style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0').

If I $ export PYTHONIOENCODING=utf-8 in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0>.

I have checked all of the locale variables and the relevant ones match what I have on my local machine.

I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:

# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET

corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
  for sent in body :
    depDOMs = [(0,'') for i in range(len(sent)+1)]
    for word in sent :
      if word.tag == 'LF' :
        pass
      elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
        ID = word.attrib['ID']
        try :
          Form =  word.text.replace(' ','_')
        except AttributeError :
          Form = '_'
        try :
          Lemma =  word.attrib['LEMMA'].replace(' ', '_')
        except KeyError :
          Lemma = '*NULL*'
        CPOS = word.attrib['FEAT'].split()[0]
        POS = word.attrib['FEAT'].replace( ' ' , '_' )
        Feats = '_'
        Head = word.attrib['DOM']
        if Head == '_root' :
          Head = '0'
        try :
          DepRel = word.attrib['LINK']
        except KeyError :
          DepRel = 'ROOT'
        PHead = '_'
        PDepRel = '_'
        try:
          if word.attrib['NODETYPE'] == 'FANTOM' :
            word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
        except KeyError :
          pass
        print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
      else :
        print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
  print()

2条回答
啃猪蹄的小仙女
2楼-- · 2019-04-14 06:49

The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.

Confirm locale configuration with locale. If there's a problem, you'll see something like:

$ locale 
locale: Cannot set LC_CTYPE to default locale: No such file or directory 
locale: Cannot set LC_ALL to default locale: No such file or directory 
LANG=en_US.UTF-8 
LANGUAGE= 

Fix this with:

$ sudo locale-gen "en_US.UTF-8"

(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue

查看更多
再贱就再见
3楼-- · 2019-04-14 06:52

You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.

Quoting the documentation:

UnicodeError has attributes that describe the encoding or decoding error. For example, err.object[err.start:err.end] gives the particular invalid input that the codec failed on.

encoding

The name of the encoding that raised the error.

reason

A string describing the specific codec error.

object

The object the codec was attempting to encode or decode.

start

The first index of invalid data in object.

end

The index after the last invalid data in object.

查看更多
登录 后发表回答