替換python中的HTML代碼

-1

我使用正則表達式來解析網站的源代碼並在Tkinter窗口中顯示新聞標題。我被告知用正則表達式解析HTML並不是最好的主意，但不幸的是現在沒有時間去改變。替換python中的HTML代碼

我似乎無法替換特殊字符的HTML代碼，如撇號（'）。

目前，我有以下幾點：

union_url = 'http://www.news.com.au/sport/rugby' 

def union(): 
    union_string = urlopen(union_url).read() 
    union_string.replace("&#8217;", "'") 
    union_headline = re.findall('(?:sport/rugby/.*) >(.*)<', union_string) 
    union_headline_label= Label(union_window, text = union_headline[0], font=('Times',20,'bold'), bg = 'White', width = 85, height = 3, wraplength = 500)

這不擺脫的HTML字符。作爲一個例子，標題打印爲

Larkham: Real worth of &#8216;Giteau&#8217;s Law&#8217;

我試圖找到一個沒有任何運氣的答案。任何幫助深表感謝。

來源

2015-10-14 BlizzzX

你試圖獲取數據或從解析HTML源數據？ – Ja8zyjits

對不起 - 獲取數據顯示在tkinter小部件 – BlizzzX

曾聽說過[美麗的湯]（http://www.crummy.com/software/BeautifulSoup/）你的生活將會更好用這個...解析HTML可以很難。 – Ja8zyjits

你可以使用應用re.sub（）來UNESCAPE的「調用」功能（或刪除）任何轉義：

>>> import re 
>>> def htmlUnescape(m): 
...  return unichr(int(m.group(1), 16)) 
... 
>>> re.sub('&#([^;]+);', htmlUnescape, "This is something &#8217; with an HTML-escaped character in it.") 
u'This is something \u8217 with an HTML-escaped character in it.' 
>>>

來源

2015-10-14 09:25:58

替換python中的HTML代碼

回答

相關問題