urllib2的編碼問題

這是我的示例腳本：urllib2的編碼問題

import urllib2, re 

response = urllib2.urlopen('http://domain.tld/file') 
data  = response.read() # Normally displays "the emoticon <3 is blah blah" 

pattern = re.search('(the emoticon)(.*)(is blah blah)', data) 
result = pattern.group(2) # result should contain "<3" now 

print 'The result is ' + result # prints "&lt;3" because not encoded

正如你所看到的，我得到一個網頁，並試圖獲得一個串出來的，但它的編碼不正確，因爲我不確定要添加到此腳本中的是什麼o使最終結果正確。任何人都可以指出我做錯了什麼？

來源

2012-05-12 Markum

你可能想看看[這個問題]（http://stackoverflow.com/questions/1208916/decoding-html-entities-with-python）。 –

@Lattyware看着，沒有看到太多的幫助，因爲我寧願不使用外部模塊。 – Markum

試試這個：

>>> import HTMLParser 
>>> h = HTMLParser.HTMLParser() 
>>> h.unescape('wer&amp;wer') 
u'wer&wer'

來源

2012-05-12 05:29:30 lenik

urllib2的編碼問題

回答

相關問題