2012-04-22 28 views
0

我有以下HTML:get_text()有UnicodeEncodeError

<div class="dialog"> 
<div class="title title-with-sort-row"> 
    <h2>Description</h2> 
    <div class="dialog-search-sort-bar"> 
    </div> 
</div> 
<div class="content"><div style="margin-right: 20px; margin-left: 30px;"> 
    <span class="description2"> 
     With 「Antonia Polygon – Standard」, you have a figure that is unique in the Poser community. 
     She is made available under a Creative Commons License that gives endless opportunities for further development. 
     This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
     The result is a figure that has very good bending and morphing behavior. 
     <br /> 
    </span> 
</div> 
</div> 

我需要找到這個div出class="dialog"數的div,然後拉出在span class="description2"文本。

當我使用的代碼:

description = soup.find(text = re.compile('Description')) 
if description != None: 
    someEl = description.parent 
    parent1 = someEl.parent 
    parent2 = parent1.parent 
    description = parent2.find('span', {'class' : 'description2'}) 
    print 'Description: ' + str(description) 

我得到:

<span class="description2"> 
    With Â「Antonia Polygon – StandardÂ」, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior. 
    <br/> 
</span> 

如果我試圖讓只是文本,而HTML &非ASCII字符,使用

description = description.get_text() 

我收到一個(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何將這個HTML塊轉換爲直線ascii?

+0

字符'''不是ASCII字符。您的目標是確定最相似的字符是ASCII(這很難),或者您的目標是簡單地移除所有非ASCII字符?或者是你真正想要輸出正確的Unicode,例如UTF-8,而不是ASCII? – jogojapan 2012-04-23 02:04:45

+0

只是刪除所有非ASCII字符 – Stephen 2012-04-24 21:44:22

+0

強制:http://bit.ly/unipain – Daenyth 2012-05-07 12:55:56

回答

2
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

foo = u'With Â「Antonia Polygon – StandardÂ」, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.' 

print foo.encode('ascii', 'ignore') 

有三件事要注意。

首先是'ignore'參數的編碼方法。它指示方法刪除不在所選編碼範圍內的字符(在這種情況下,ascii爲安全)。

其次是我們明確地將foo設置爲unicode,方法是在字符串前加上u

三是顯式文件編碼指令:# -*- coding: utf8 -*-

另外,如果你在閱讀這個答案時沒有閱讀Daenyth的評論,那麼你就是一個愚蠢的人。如果要在HTML/XML中使用輸出,則可以使用xmlcharrefreplace代替上面的ignore,以取得很好的公正性。

+1

在這種情況下使用'xmlcharrefreplace'作爲第二個參數將會好很多,因爲他正在處理html。 – Daenyth 2012-05-07 12:54:53

+0

是的,我同意。我只是懶惰,因爲OP在評論中說,他只是想刪除所有的行爲不端的字符。 :) – JosefAssad 2012-05-07 12:59:54

+1

不過,值得一提的是,如果他們有類似的問題,其他人可能會遇到這種情況。 – Daenyth 2012-05-07 13:02:23