使用python刪除html標籤？

我知道這可能有一百萬個問題，但我想知道如何刪除這些標籤，而無需導入或使用HTMLParser或正則表達式。我嘗試了一堆不同的替換語句來試圖刪除由<>所包含的部分字符串，但無濟於事。使用python刪除html標籤？

基本上我一起工作是：

response = urlopen(url) 
html = response.read() 
html = html.decode()

從這裏我只是試圖操縱字符串變量HTML做以上。有沒有辦法像我指定的那樣去做，或者你必須使用我見過的以前的方法嗎？

我也試圖讓一個for循環，通過每一個角色去檢查，如果它是封閉的，但由於某些原因，它不會給我一個正確的打印出來，那就是：

for i in html: 
    if i == '<': 
     html.replace(i, '') 
     delete = True 
    if i == '>': 
     html.replace(i, '') 
     delete = False 
    if delete == True: 
     html.replace(i, '')

會欣賞任何輸入。

來源

2014-02-26 user2909869

請不要」使用正則表達式解析HTML。它不會工作，請參閱http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags獲得有趣的解釋。 –

_無需導入或使用HTMLParser或regex._爲什麼你給自己這樣愚蠢的限制。 –

一個令人誤解的標題 – Totem

str.replace返回一個字符串的副本，其中所有出現的子字符串被new替換，你不能像你那樣使用它，你不應該修改你的循環迭代的字符串。額外的名單使用是的，你可以去的方法之一：

txt = [] 
for i in html: 
    if i == '<': 
     delete = True 
     continue 
    if i == '>': 
     delete = False 
     continue 
    if delete == True: 
     continue 

    txt.append(i)

現在txt列表包含結果的文字，你可以加入：

print ''.join(txt)

演示：

html = '<body><div>some</div><div>text</div></body>' 
#... 
>>> txt 
['s', 'o', 'm', 'e', 't', 'e', 'x', 't'] 
>>> ''.join(txt) 
'sometext'

來源

2014-02-26 14:11:05 ndpu

謝謝，我一直在尋找一種方法來做到這一點，而不必使用一些預先實施的方法，因爲我沒有從中學到任何東西。 – user2909869

使用python刪除html標籤？

回答

相關問題