檢測並更改python中的網站編碼

我在使用網站編碼時遇到問題。我製作了一個程序來抓取一個網站，但我沒有成功地改變收錄內容的編碼。我的代碼是：檢測並更改python中的網站編碼

import sys,os,glob,re,datetime,optparse 
import urllib2 

from BSXPath import BSXPathEvaluator,XPathResult 
#import BeautifulSoup 

#from utility import * 

sTargetEncoding = "utf-8" 

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process) 
content = req.read() 
encoding=req.headers['content-type'].split('charset=')[-1] 
print encoding 

ucontent = unicode(content, encoding).encode(sTargetEncoding) 
#ucontent = content.decode(encoding).encode(sTargetEncoding) 
#ucontent = content 

document = BSXPathEvaluator(ucontent) 

print "ORIGINAL ENCODING: " + document.originalEncoding

我使用外部庫（BSXPath BeautifulSoap的擴展）和document.originalEncoding打印網頁的編碼，而不是UTF-8編碼，我試圖改變。有沒有人建議？

感謝

來源

2011-03-31 kl4us

好了，也不能保證通過HTTP頭提供的編碼是相同的HTML本身內部的一些規定。這可能是由於服務器端配置錯誤或HTML內部的字符集定義錯誤造成的。真的沒有自動的方法來檢測編碼或檢測編碼的。我建議手動檢查HTML編碼是否正確（例如，可以輕鬆檢測到iso-8859-1與utf-8），然後在您的應用程序中以手動方式硬編碼編碼。

來源

2011-03-31 09:00:22

檢測並更改python中的網站編碼

回答

相關問題