使用機械化從網站獲取中文字符不會返回任何內容

我正在嘗試取消中文字符以及非標準字母。在結果中，機械化就像跳過了中文字符或非標準字母。使用機械化從網站獲取中文字符不會返回任何內容

我的代碼：

import mechanize 
import re 

br = mechanize.Browser() 
br.addheaders = [('User-agent', 'Mozilla/5.0')] 
br.set_handle_robots(False) 

html = br.open('http://hanzidb.org/character-list/by-frequency') 

html = html.read().lower() 
html = unicode(html, errors='ignore') 

#Only get the data between <td>...</dr> 
pattern2 = re.compile(r'<td>(.*?)</td>', re.MULTILINE) 
match_description2 = re.findall(pattern2, html) 

data = [] 

#Collect the content of the table 
for desc in match_description2: 
    data.append(desc) 
    print desc

結果我應該得到（例如）：

<tr><td><a href="/character/是">是</a></td><td><span style="color:#000099;">shì</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/日" title="Kangxi radical 72">日</a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td></tr>

對戰的結果我得到：

<td><a href="/character/"></a></td><td><span style="color:#000099;">sh</span></td><td><span class="smmr">indeed, yes, right; to be; demonstrative pronoun, this, that</span></td><td><a href="/character/" title="kangxi radical 72"></a>&nbsp;72.5</td><td>9</td><td>1</td><td>1479</td></td><td>3</td>

我感謝所有幫助和如有需要，我可以提供更多信息。

來源

2016-03-25 CJ Jacobs

請使用'beautifulsoup4'來解析HTML。使用正則表達式的HTML可能會導致[不良結果]（http://stackoverflow.com/a/1732454/918959） –

您必須將其清除html = unicode(html, errors='ignore')

你對LANG終端環境必須是UTF-8

並運行代碼！

來源

2016-03-25 04:46:55 han058

對不起，我沒有說明這是我的所有代碼，只有相關的位。此外，這一變化的工作，感謝噸！ –

沒問題，我們歡迎您。 – han058

使用機械化從網站獲取中文字符不會返回任何內容

回答

相關問題