Python轉換非標準字符

我有一個列表，我從包含一些非標準字符的網頁中提取。Python轉換非標準字符

列表例如：

[<td class="td-number-nowidth"> 10Â 115 </td>, <td class="td-number-nowidth"> 4Â 635 (46%) </td>, <td class="td-number-nowidth"> 5Â 276 (52%) </td>, ...]

與帽子A被認爲是逗號。有人可以建議如何轉換或替換這些，所以我可以得到值10115，如列表中的第一個值？

的源代碼：

from urllib import urlopen 
from bs4 import BeautifulSoup 
import re, string 
content = urlopen('http://www.worldoftanks.com/community/accounts/1000395103-FrankenTank').read() 
soup = BeautifulSoup(content) 

BattleStats = soup.find_all('td', 'td-number-nowidth') 
print BattleStats

感謝，弗蘭克

來源

2012-10-13 User Error

你以前的問題表明你使用'BeautifulSoup（）'應該自動處理字符編碼。你如何得到''？（提供一些代碼） – jfs

你是對的J.F.這裏是我玩的代碼（在上面發佈）。 –

該網站是否對編碼說，在這Content-Encoding頭？你必須得到它，並使用.decode方法解碼列表中的那些字符串。它會像encoded_string .decode（「encoding」）。 encoding可以是任何東西，utf-8就是其中之一。

來源

2012-10-13 23:53:37

網站編碼表示UTF-8 –

然後取每個字符串並使用'encoded_string.decode（'utf-8'）' –

對我來說不幸的是，這個回答不起作用，因爲網站/說/ UTF-8，但包含垃圾字符全部都一樣。 – davenpcj

你有嘗試嗎？

這可能有效。

a = ['10Â 115', '4Â 635 (46%)', '5Â 276 (52%)'] 
for b in a: 
    b.replace("\xc3\x82 ", '')

輸出：

10115 
4635 (46%) 
5276 (52%)

取決於它是如何不變（如果它總是隻有一個點的一個），可能有更好的方法去（從\更換任何東西的空間帶有空白字符）。

來源

2012-10-13 23:56:38 MercuryRising

您可以使用.decode方法和errors='ignore'參數。

>>> s = '[ 10Â 115 , 4Â 635 (46%) , 5Â 276 (52%) , ...]' 
>>> s.decode('ascii', errors='ignore') 
u'[ 10 115 , 4 635 (46%) , 5 276 (52%) , ...]'

這裏是help(''.decode)：

decode(...) 
    S.decode([encoding[,errors]]) -> object 

    Decodes S using the codec registered for encoding. encoding defaults 
    to the default encoding. errors may be given to set a different error 
    handling scheme. Default is 'strict' meaning that encoding errors raise 
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace' 
    as well as any other name registered with codecs.register_error that is 
    able to handle UnicodeDecodeErrors.

來源

2012-10-13 23:58:34 dnozay

感謝您的解碼幫助。 –

這工作對我來說，網站返回垃圾字符以外的聲明的編碼 – davenpcj

BeautifulSoup handles character encodings automatically。問題在於打印到您的控制檯，它似乎不支持一些Unicode字符。在這種情況下，它是'NO-BREAK SPACE' (U+00A0)：

>>> L = soup.find_all('td', 'td-number-nowidth') 
>>> L[0] 
<td class="td-number-nowidth"> 10 123 </td> 
>>> L[0].get_text() 
u' 10\xa'

請注意，文本是Unicode。檢查print u'<\u00a0>'是否適用於您的情況。

在運行腳本之前，您可以通過更改PYTHONIOENCODING環境變量來操作使用的輸出編碼。因此，您可以將輸出重定向到指定utf-8編碼的文件，並使用ascii:backslashreplace值在控制檯中進行調試運行，而無需更改腳本。例在bash：

$ python -c 'print u"<\u00a0>"' # use default encoding 
< > 
$ PYTHONIOENCODING=ascii:backslashreplace python -c 'print u"<\u00a0>"' 
<\xa0> 
$ PYTHONIOENCODING=utf-8 python -c 'print u"<\u00a0>"' > output.txt

要打印出你可能分裂的非易碎空間後處理項目對應的編號：

>>> [td.get_text().split(u'\u00a0') 
... for td in soup.find_all('td', 'td-number-nowidth')] 
[[u' 10', u'115 '], [u' 4', '635 (46%) '], [u' 5', u'276 (52%) ']]

或者你可以用逗號替換爲：

>>> [td.get_text().replace(u'\u00a0', ', ').encode('ascii').strip() 
... for td in soup.find_all('td', 'td-number-nowidth')] 
['10, 115', '4, 635 (46%)', '5, 276 (52%)']

來源

2012-10-14 03:55:54 jfs

Python轉換非標準字符

回答

相關問題