I have a python script that works great on my local machine (OS X), but when I copied it to a server (Debian), it does not work as expected. The script reads an xml file and prints the contents in a new format. On my local machine, I can run the script with stdout to the terminal or to a file (i.e. > myFile.txt
), and both work fine.
However, on the server (ssh
), when I print to terminal everything works fine, but printing to the file (which is what I really need) gives UnicodeEncodeError: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
. All files are in utf-8 encoding, and utf-8 is declared in the magic comment.
If I print the str
objects inside a list (which is a trick I usually use to get a handle on encoding issues), it also throws the same error.
If I use print( x.encode('utf-8') )
, then it prints code-style bits (e.g. b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0'
).
If I $ export PYTHONIOENCODING=utf-8
in the shell (as suggested in some SO posts), then I get a binary file: 1 <D0><9A><D0><B0><D0><BC><D0><B0>
.
I have checked all of the locale
variables and the relevant ones match what I have on my local machine.
I can simply process the file locally and upload it, but I really want to understand what is happening here. Since the python code is working on one computer, I am not sure that it is relevant, but I am adding it below:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()
The underlying issue may be caused by a miss configuration of Linux's locales, meaning that Python is being too cautious when printing non-ASCII chars.
Confirm locale configuration with
locale
. If there's a problem, you'll see something like:Fix this with:
(replace "en_US.UTF-8" with the locale that's not working). For further info, see: https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue
You can find important information related to the error you are experiencing in the attributes of the UnicodeError based exception.
Quoting the documentation: