python關於中文的常規問題

我在學習python，並嘗試使用正則表達式從html中獲取一些數據，而且我很麻煩。這是我的代碼：python關於中文的常規問題

# -*- coding:utf-8 -*- 

import urllib2 
import re 

url = u'http://www.6vhao.net/dy1/' 
msg = u'ssss<a href="http://www.6vhao.net/dy1/index_2.html">下一頁</a>&nbsp;<a' 
pattern = re.compile(ur'\<a href="(?P<url>.*)"\>下一頁</a\>') 

response = urllib2.urlopen(url) 
html = response.read() 
#print html 
for m in pattern.finditer(msg): 
    s = m.group('url') 
    print 'msg: '+s 

for m in pattern.finditer(html): 
    s = m.group('url') 
    print 'html: '+s

代碼中的'msg'是我想要從html獲得的數據。但輸出只有「msg：http://www.6vhao.net/dy1/index_2.html」。我想知道爲什麼正則表達式不能在html中工作，以及如何使其正常工作。謝謝！

來源

2016-03-27 Lension

您需要的結果從.read()第一解碼爲一個Unicode對象：

html = response.read().decode("utf-8")

來源

2016-03-27 12:20:39

python關於中文的常規問題

回答

相關問題