Python刮板的Unicode問題

我一直在寫壞的Perl一段時間，但我試圖學習編寫壞python。我已經閱讀了幾天我遇到的問題（並且因此瞭解了有關unicode的更多信息），但我仍然在下面的代碼中遇到了流氓em-dash的問題：Python刮板的Unicode問題

import urllib2 

def scrape(url): 
# simplified 
    data = urllib2.urlopen(url) 
    return data.read() 

def query_graph_api(url_list): 
# query Facebook's Graph API, store data. 
    for url in url_list: 
     graph_query = graph_query_root + "%22" + url + "%22" 
     query_data = scrape(graph_query) 
     print query_data #debug console 

### START HERE #### 

graph_query_root = "https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url=" 

url_list = ['http://www.supersavvyme.co.uk', 'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more'] 

query_graph_api(url_list)

（這是刮的更簡化表示，BTW原來採用的是網站的sitemap.xml打造的URL列表，然後查詢Facebook的圖形API對每一個信息 - 這裏的the original scraper）

我試圖去調試這個主要是試圖模仿重寫莎士比亞的無限猴子。我通常的方法（搜索用於錯誤消息的StackOverflow，複製並粘貼解決方案）失敗。

問題：如何對數據進行編碼，以便像第二個URL中的em-dash這樣的擴展字符不會破壞我的代碼，但仍然可以在FQL查詢中工作？

P.S.我甚至不知道我是否問正確的問題：可能urllib.urlencode幫我在這裏（當然，這將使該graph_query_root更容易和漂亮創建...

--- 8 < ----

我從ScraperWiki實際刮刀獲得回溯如下：

http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more 
Line 80 - query_graph_api(urls) 
Line 53 - query_data = scrape(graph_query) -- query_graph_api((urls=['http://www.supersavvyme.co.uk', 'http://...more 
Line 21 - data = urllib2.urlopen(unicode(url)) -- scrape((url=u'https://graph.facebook.com/fql?q=SELECT%20url,...more 
/usr/lib/python2.7/urllib2.py:126 -- urlopen((url=u'https://graph.facebook.com/fql?q=SELECT%20url,no...more 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 177: ordinal not in range(128)

來源

2013-04-28 mediaczar

你能否在你的問題中包含精確的問題？ – 2013-04-28 18:57:14

你可以發佈回溯？ – MikeHunter 2013-04-28 19:04:26

如果您正在使用Python 3.x中，所有你需要做的就是添加一行，並換另：

gq = graph_query.encode('utf-8') 
query_data = scrape(gq)

如果您正在使用Python 2.x中，首先把下面一行在模塊文件的頂部：

# -*- coding: utf-8 -*-（讀什麼這是here）

，然後讓你的所有字符串文字unicode和編碼只是傳遞到之前的urlopen：

def scrape(url): 
# simplified 
    data = urllib2.urlopen(url) 
    return data.read() 

def query_graph_api(url_list): 
# query Facebook's Graph API, store data. 
    for url in url_list: 
     graph_query = graph_query_root + u"%22" + url + u"%22" 
     gq = graph_query.encode('utf-8') 
     query_data = scrape(gq) 
     print query_data #debug console 

### START HERE #### 

graph_query_root = u"https://graph.facebook.com/fql?q=SELECT%20normalized_url,share_count,like_count,comment_count,total_count%20FROM%20link_stat%20WHERE%20url=" 

url_list = [u'http://www.supersavvyme.co.uk', u'http://www.supersavvyme.co.uk/article/how-to-be-happy–laugh-more'] 

query_graph_api(url_list)

像你使用3.x中，這是對付像這樣的東西，真是再好不過了它從代碼看起來。但是你必須在必要時進行編碼。在2.x中，最好的建議是做默認的3.x：在整個代碼中使用unicode，並且只在調用字節時進行編碼。

來源

2013-04-28 20:14:47 MikeHunter

Python刮板的Unicode問題

回答

相關問題