£顯示在urllib2和美麗的湯

我想在python中寫一個小型web刮板，我想我遇到了一個編碼問題。我想刮（在頁面上專門的表格）http://www.resident-music.com/tickets - 一個行可能是這個樣子 -£顯示在urllib2和美麗的湯

<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>&pound;55.00</strong></p> 
     </td> 
     </tr>

我基本上是試圖用£55，以取代£55.00，和其他任何「非文字'髒話。

我已經嘗試了幾種不同的編碼方式，你可以用beautifulsoup和urllib2去 - 無濟於事，我想我只是做了一切錯誤。

感謝

來源

2016-09-30 Ollie

你想UNESCAPE的HTML，你可以做使用html.unescape在python3：

In [14]: from html import unescape 

In [15]: h = """<tr> 
    ....:   <td style="width:64.9%;height:11px;"> 
    ....:   <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p> 
    ....:   </td> 
    ....:   <td style="width:13.1%;height:11px;"> 
    ....:   <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p> 
    ....:   </td> 
    ....:   <td style="width:15.42%;height:11px;"> 
    ....:   <p><strong>various</strong></p> 
    ....:   </td> 
    ....:   <td style="width:6.58%;height:11px;"> 
    ....:   <p><strong>&pound;55.00</strong></p> 
    ....:   </td> 
    ....:  </tr>""" 

In [16]: 

In [16]: print(unescape(h)) 
<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>£55.00</strong></p> 
     </td> 
     </tr>

對於python2使用：

In [6]: from html.parser import HTMLParser 

In [7]: unescape = HTMLParser().unescape 

In [8]: print(unescape(h)) 
<tr> 
     <td style="width:64.9%;height:11px;"> 
     <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p> 
     </td> 
     <td style="width:13.1%;height:11px;"> 
     <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p> 
     </td> 
     <td style="width:15.42%;height:11px;"> 
     <p><strong>various</strong></p> 
     </td> 
     <td style="width:6.58%;height:11px;"> 
     <p><strong>£55.00</strong></p> 
     </td>

你可以同時看到正確UNESCAPE所有實體不只是英鎊符號。

來源

2016-10-01 00:01:49

我用requests這個但希望你能做到這一點用也urllib2。所以這是代碼：

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import requests 
from BeautifulSoup import BeautifulSoup 

soup = BeautifulSoup(requests.get('your_url').text) 
chart = soup.findAll(name='tr') 
print str(chart).replace('&pound;',unichr(163)) #replace '&pound;' with '£'

現在你應該採取預期的輸出！

輸出示例：

... 
<strong>£71.50</strong></p> 
...

反正關於解析你可以用很多方法去做，什麼是有趣這裏是：print str(chart).replace('£',unichr(163))這是相當具有挑戰性:)

更新

如果你想逃離多個（甚至一個）字符（如破折號，磅等），對你來說會更容易/更有效率在Padraic的回答中使用parser。有時你也會閱讀他們處理的評論和其他編碼問題。

來源

2016-09-30 22:23:22 coder

這不是你想如何使用unescape html，這意味着調用替換頁面上的每個轉義實體，並且初始str本身也可能導致編碼錯誤。我也不會鼓勵使用BeautifulSoup3。 –

我尊重你的評論，但我會不同意你的看法，如果你看看這裏：https：//wiki.python.org/moin/EscapingHtml你會看到那些準備好的庫做的和我一樣代碼行，不同之處在於它們會給我準備好的結果，我個人不贊成。在某些情況下，他們完成這項工作，但這是一項非常具體且簡單的任務。至於'bs3'而不是'bs4'，對於OP想要做什麼來說並不重要。但我也尊重你的意見！ – coder

*我基本上試圖用55英鎊，**和任何其他「非文字」的髒東西來代替£ 55.00。***。 *其他'非文字'nasties *是逃脫的實體，可能是衆多的任何一個。它也很重要，bs3被打破，不再維護。 –

£顯示在urllib2和美麗的湯

回答

相關問題