get_text（）有UnicodeEncodeError

我有以下HTML：get_text（）有UnicodeEncodeError

<div class="dialog"> 
<div class="title title-with-sort-row"> 
    <h2>Description</h2> 
    <div class="dialog-search-sort-bar"> 
    </div> 
</div> 
<div class="content"><div style="margin-right: 20px; margin-left: 30px;"> 
    <span class="description2"> 
     With 「Antonia Polygon – Standard」, you have a figure that is unique in the Poser community. 
     She is made available under a Creative Commons License that gives endless opportunities for further development. 
     This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
     The result is a figure that has very good bending and morphing behavior. 
     <br /> 
    </span> 
</div> 
</div>

我需要找到這個div出class="dialog"數的div，然後拉出在span class="description2"文本。

當我使用的代碼：

description = soup.find(text = re.compile('Description')) 
if description != None: 
    someEl = description.parent 
    parent1 = someEl.parent 
    parent2 = parent1.parent 
    description = parent2.find('span', {'class' : 'description2'}) 
    print 'Description: ' + str(description)

我得到：

<span class="description2"> 
    With Â「Antonia Polygon Â– StandardÂ」, you have a figure that is unique in the Poser community. 
    She is made available under a Creative Commons License that gives endless opportunities for further development. 
    This figure was developed by a group of talented members of the Poser community in a thirty-month effort. 
    The result is a figure that has very good bending and morphing behavior. 
    <br/> 
</span>

如果我試圖讓只是文本，而HTML &非ASCII字符，使用

description = description.get_text()

我收到一個(UnicodeEncodeError): 'ascii' codex can't encode character u'\x93'

如何將這個HTML塊轉換爲直線ascii？

來源

2012-04-22 Stephen

字符'''不是ASCII字符。您的目標是確定最相似的字符是ASCII（這很難），或者您的目標是簡單地移除所有非ASCII字符？或者是你真正想要輸出正確的Unicode，例如UTF-8，而不是ASCII？ – jogojapan 2012-04-23 02:04:45

只是刪除所有非ASCII字符 – Stephen 2012-04-24 21:44:22

強制：http://bit.ly/unipain – Daenyth 2012-05-07 12:55:56

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

foo = u'With Â「Antonia Polygon Â– StandardÂ」, you have a figure that is unique in the Poser community.She is made available under a Creative Commons License that gives endless opportunities for further development. This figure was developed by a group of talented members of the Poser community in a thirty-month effort. The result is a figure that has very good bending and morphing behavior.' 

print foo.encode('ascii', 'ignore')

有三件事要注意。

首先是'ignore'參數的編碼方法。它指示方法刪除不在所選編碼範圍內的字符（在這種情況下，ascii爲安全）。

其次是我們明確地將foo設置爲unicode，方法是在字符串前加上u。

三是顯式文件編碼指令：# -*- coding: utf8 -*-。

另外，如果你在閱讀這個答案時沒有閱讀Daenyth的評論，那麼你就是一個愚蠢的人。如果要在HTML/XML中使用輸出，則可以使用xmlcharrefreplace代替上面的ignore，以取得很好的公正性。

來源

2012-05-07 12:31:04 JosefAssad

在這種情況下使用'xmlcharrefreplace'作爲第二個參數將會好很多，因爲他正在處理html。 – Daenyth 2012-05-07 12:54:53

是的，我同意。我只是懶惰，因爲OP在評論中說，他只是想刪除所有的行爲不端的字符。 :) – JosefAssad 2012-05-07 12:59:54

不過，值得一提的是，如果他們有類似的問題，其他人可能會遇到這種情況。 – Daenyth 2012-05-07 13:02:23

get_text（）有UnicodeEncodeError

回答

相關問題