Python script receiving a UnicodeEncodeError: '

2019-03-04 07:59发布

问题:

I have a simple Python script that pulls posts from reddit and posts them on Twitter. Unfortunately, tonight it began having issues that I'm assuming are because of someone's title on reddit having a formatting issue. The error that I'm reciving is:

  File "redditbot.py", line 82, in <module>
  main()
 File "redditbot.py", line 64, in main
 tweeter(post_dict, post_ids)
 File "redditbot.py", line 74, in tweeter
 print post+" "+post_dict[post]+" #python"
 UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in  position 34: ordinal not in range(128)

And here is my script:

# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')

access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'


def strip_title(title):
    if len(title) < 75:
    return title
else:
    return title[:74] + "..."

def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
    post_dict[strip_title(submission.title)] = submission.url
    post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
    post_title = post
    post_link = post_dict[post]

    mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids

def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
            'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit



def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
    for line in file:
        if id in line:
            found = 1
return found

def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
    file.write(str(id) + "\n")

def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)

def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
    found = duplicate_check(post_id)
    if found == 0:
        print "[bot] Posting this link on twitter"
        print post+" "+post_dict[post]+" #python"
        api.update_status(post+" "+post_dict[post]+" #python")
        add_id_to_file(post_id)
        time.sleep(3000)
    else:
        print "[bot] Already posted"

if __name__ == '__main__':
main()

Any help would be very much appreciated - thanks in advance!

回答1:

Consider this simple program:

print(u'\u201c' + "python")

If you try printing to a terminal (with an appropriate character encoding), you get

“python

However, if you try redirecting output to a file, you get a UnicodeEncodeError.

script.py > /tmp/out
Traceback (most recent call last):
  File "/home/unutbu/pybin/script.py", line 4, in <module>
    print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

When you print to a terminal, Python uses the terminal's character encoding to encode unicode. (Terminals can only print bytes, so unicode must be encoded in order to be printed.)

When you redirect output to a file, Python can not determine the character encoding since files have no declared encoding. So by default Python2 implicitly encodes all unicode using the ascii encoding before writing to the file. Since u'\u201c' can not be ascii encoded, a UnicodeEncodeError. (Only the first 127 unicode code points can be encoded with ascii).

This issue is explained in detail in the Why Print Fails wiki.


To fix the problem, first, avoid adding unicode and byte strings. This causes implicit conversion using the ascii codec in Python2, and an exception in Python3. To future-proof your code, it is better to be explicit. For example, encode post explicitly before formatting and printing the bytes:

post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))


回答2:

You are trying to print a unicode string to your terminal (or possibly a file by IO redirection), but the encoding used by your terminal (or file system) is ASCII. Because of this Python attempts to convert it from the unicode representation to ASCII, but fails because codepoint u'\u201c' () can not be represented in ASCII. Effectively your code is doing this:

>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

You could try converting to UTF-8:

print (post + " " + post_dict[post] + " #python").encode('utf8')

or convert to ASCII like this:

print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')

which will replace invalid ASCII characters with ?.

Another way, which is useful if you are printing for debugging purposes, is to print the repr of the string:

print repr(post + " " + post_dict[post] + " #python")

which would output something like this:

>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'


回答3:

The problem likely arises from mixing bytestrings and unicode strings on concatenation. As an alternative to prefixing all string literals with u, maybe

from __future__ import unicode_literals

fixes things for you. See here for a deeper explanation and to decide whether it's an option for you or not.