2016-01-17 82 views
0

我有一個從reddit的拉職位和他們在Twitter上簡單的Python腳本。不幸的是,今晚它開始出現我所假設的問題,因爲某人在reddit上的標題有格式問題。那我reciving的錯誤是:Python腳本接收UnicodeEncodeError:「ASCII」編解碼器不能編碼字符

File "redditbot.py", line 82, in <module> 
    main() 
File "redditbot.py", line 64, in main 
tweeter(post_dict, post_ids) 
File "redditbot.py", line 74, in tweeter 
print post+" "+post_dict[post]+" #python" 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128) 

這裏是我的腳本:

# encoding=utf8 
import praw 
import json 
import requests 
import tweepy 
import time 
import urllib2 
import sys 
reload(sys) 
sys.setdefaultencoding('utf8') 

access_token = 'hidden' 
access_token_secret = 'hidden' 
consumer_key = 'hidden' 
consumer_secret = 'hidden' 


def strip_title(title): 
    if len(title) < 75: 
    return title 
else: 
    return title[:74] + "..." 

def tweet_creator(subreddit_info): 
post_dict = {} 
post_ids = [] 
print "[bot] Getting posts from Reddit" 
for submission in subreddit_info.get_hot(limit=2000): 
    post_dict[strip_title(submission.title)] = submission.url 
    post_ids.append(submission.id) 
print "[bot] Generating short link using goo.gl" 
mini_post_dict = {} 
for post in post_dict: 
    post_title = post 
    post_link = post_dict[post] 

    mini_post_dict[post_title] = post_link 
return mini_post_dict, post_ids 

def setup_connection_reddit(subreddit): 
print "[bot] setting up connection with Reddit" 
r = praw.Reddit('PythonReddit PyReTw' 
      'monitoring %s' %(subreddit)) 
subreddit = r.get_subreddit('python') 
return subreddit 



def duplicate_check(id): 
found = 0 
with open('posted_posts.txt', 'r') as file: 
    for line in file: 
     if id in line: 
      found = 1 
return found 

def add_id_to_file(id): 
with open('posted_posts.txt', 'a') as file: 
    file.write(str(id) + "\n") 

def main(): 
subreddit = setup_connection_reddit('python') 
post_dict, post_ids = tweet_creator(subreddit) 
tweeter(post_dict, post_ids) 

def tweeter(post_dict, post_ids): 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret) 
api = tweepy.API(auth) 
for post, post_id in zip(post_dict, post_ids): 
    found = duplicate_check(post_id) 
    if found == 0: 
     print "[bot] Posting this link on twitter" 
     print post+" "+post_dict[post]+" #python" 
     api.update_status(post+" "+post_dict[post]+" #python") 
     add_id_to_file(post_id) 
     time.sleep(3000) 
    else: 
     print "[bot] Already posted" 

if __name__ == '__main__': 
main() 

任何幫助將是非常讚賞 - 在此先感謝!

+1

你介意修理你的例子的縮進:例如,格式和打印字節之前編碼post明確? – karlson

+0

你可能會覺得這篇文章有用:[Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html),這是SO老將Ned Batchelder寫的。 –

回答

1

問題可能源自於串聯混合字節串和unicode字符串。作爲在u前綴所有字符串文字的替代方法,可能爲

from __future__ import unicode_literals 

爲您修復了一些事情。請參閱here以獲得更深入的解釋,並決定它是否適合您。

2

你要打印unicode字符串到終端(或者可能是通過IO重定向文件),但您的終端(或文件系統)中使用的編碼是ASCII。由於Python試圖將其從unicode表示轉換爲ASCII,但因爲代碼點u'\u201c')無法用ASCII表示,所以它失敗。有效地你的代碼是這樣做的:

>>> print u'\u201c'.encode('ascii') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128) 

你可以嘗試轉換爲UTF-8:

print (post + " " + post_dict[post] + " #python").encode('utf8') 

或轉換爲ASCII這樣的:

print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace') 

將取代無效的ASCII字符與?

另一種方式,如果要打印的調試的目的是有用的,是打印字符串的repr

print repr(post + " " + post_dict[post] + " #python") 

這將輸出是這樣的:

>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c' 
>>> print repr(s) 
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c' 
3

考慮這個簡單的程序:

print(u'\u201c' + "python") 

如果您嘗試打印到終端L(用適當的字符編碼),你會得到

「python 

但是,如果你試圖輸出重定向到一個文件,你會得到一個UnicodeEncodeError

script.py > /tmp/out 
Traceback (most recent call last): 
    File "/home/unutbu/pybin/script.py", line 4, in <module> 
    print(u'\u201c' + "python") 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128) 

當您打印到終端時,Python使用終端的字符編碼來編碼unicode。 (終端只能打印字節,所以unicode的必須按順序進行編碼,以進行打印。)

當重定向輸出到文件,Python不能確定字符編碼,因爲文件沒有聲明編碼。因此默認情況下,Python2在寫入文件之前使用ascii編碼隱式編碼所有unicode。由於u'\u201c'不能被ascii編碼,所以UnicodeEncodeError。 (只有前127個unicode代碼點可以用ascii編碼)。

此問題在Why Print Fails wiki中有詳細說明。


要解決這個問題,首先要避免添加unicode和字節字符串。這會導致使用Python2中的ascii編解碼器進行隱式轉換,以及Python3中的異常。爲了將來能夠驗證你的代碼,最好是明確的。

post = post.encode('utf-8') 
print('{} {} #python'.format(post, post_dict[post])) 
相關問題