Python中，得到一個HTML文檔的文本值

我的問題很簡單，我有一個包含HTML標籤，我只是想從該字符串，例如實際文本值的字符串：Python中，得到一個HTML文檔的文本值

HTML字符串：

<strong><p> hello </p><p> world </p></strong>

文本值：你好世界

有沒有能夠做到這一點的功能？

來源

2013-08-27 Rachid Oussanaa

您可以使用BeautifulSoup的get_text()功能：

from bs4 import BeautifulSoup 


text = "<strong><p> hello </p><p> world </p></strong>" 

soup = BeautifulSoup(text) 
print soup.get_text() # prints " hello world "

或者，你可以使用nltk：

import nltk 


text = "<strong><p> hello </p><p> world </p></strong>" 
print nltk.clean_html(text) # prints "hello world"

另一種選擇是使用html2text，但它的行爲有點defferently：例如strong被替換爲*。

另見相關主題：Extracting text from HTML file using Python

希望有所幫助。

來源

2013-08-27 19:00:05 alecxe

感謝BeautifulSoup的功能運作良好，但是當我嘗試打印導致文本它給了我這個錯誤一個問題：UnicodeEncodeError：「ASCII」編解碼器不能編碼字符U「\ xe9」在47位置：有序不在範圍內（128），PS：我有一個包含重音 –

不要打擾我發現這裏的解決方案法文本工作http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character- U型XA0-到位-20 –

Python中，得到一個HTML文檔的文本值

回答

相關問題