Python - 刪除標記標籤並從文件中讀取html？

我有一個名爲BBC_news_home.html的文件，我需要刪除所有標記標記，所以我留下的只是一些文本。到目前爲止，我有：Python - 刪除標記標籤並從文件中讀取html？

def clean_html(html): 
    cleaned = '' 

line = html 

pattern = r'(<.*?>)' 

result = re.findall(pattern, line, re.S) 

if result: 
    f = codecs.open("BBC_news_home.html", 'r', 'utf-8') 
    print(f.read()) 
else: 
    print('Not cleaned.') 
return cleaned

我與regex101.com檢查的模式是正確的我只是不知道如何打印輸出，以檢查是否標記標籤都沒有了？

來源

2017-10-10 Wub

您可能想查看[BeautifulSoup]（https://www.crummy.com/software/BeautifulSoup/bs4/doc/），更具體地說[.get_text（）]（https：//www.crummy。 COM /軟件/ BeautifulSoup/BS4/DOC /＃獲取文本）。 –

你真的應該使用BeautifulSoup這個。根據你需要的python版本做pip3 install BeautifulSoup4或pip install BeautifulSoup4。我已經發布了對類似問題here的回答。爲完整起見：

from bs4 import BeautifulSoup 

def cleanme(html): 
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded 
    for script in soup(["script"]): 
     script.extract() 
    text = soup.get_text() 
    return text 
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>" 

cleaned = cleanme(testhtml) 
print (cleaned)

而輸出結果只是I need this text captured And this。

來源

2017-10-10 17:59:04 jamescampbell

Python - 刪除標記標籤並從文件中讀取html？

回答

相關問題