使用BeautifulSoup從文本/ html文檔獲取乾淨的文本

我有一個文檔有兩種內容類型：text/xml和text/html。我想用BeautifulSoup來解析文檔，並最終得到一個乾淨的文本版本。該文檔以元組的形式開始，因此我一直使用repr將其變成BeautifulSoup識別的內容，然後使用find_all通過搜索div來查找文檔的文本/ html位，如下所示：使用BeautifulSoup從文本/ html文檔獲取乾淨的文本

soup = BeautifulSoup(repr(msg_data)) 
text = soup.html.find_all("div")

然後，我將文本轉換回字符串，將其保存到一個變量，然後把它放回湯對象並調用get_text就可以了，就像這樣：

str_text = str(text) 
soup_text = BeautifulSoup(str_text) 
soup_text.get_text()

然而，然後改變編碼爲unicode，如下所示：

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17  
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]'

當我試圖重新編碼爲UTF-8，像這樣：

soup.encode('utf-8')

我回未解析類型。

我想讓我把乾淨的文本保存爲一個字符串，然後我可以在文本中找到特定的東西（例如，上面的文本中的「小狗」）。

基本上，我在這裏跑來跑去。誰能幫忙？與往常一樣，非常感謝您爲您提供的任何幫助。

來源

2012-03-18 spikem

編碼不被破壞;這正是它應該的。 '\xa0'是非破壞性空間的Unicode。

如果你想這個（Unicode）的字符串作爲ASCII編碼，你可以告訴編解碼器忽略任何字符不理解：

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 
>>> x.encode('ascii', 'ignore') 
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do, 9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while browsing their site, me: srsly, Erica: unless of course your writing is magic, me: My writing saves drowning puppies, Just plucks him right out and gives them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]'

如果你有時間，你應該看斯內德爾德最近視頻Pragmatic Unicode。它會使一切變得簡單明瞭！

來源

2012-03-18 19:59:05 katrielalex

是的，它發生在我身上，正如我發佈的那樣，「毀了」有點強烈，現在就編輯它。謝謝你的視頻，我會看看。你是否有任何我可以仔細閱讀的文本資源（我知道這些只是谷歌搜索了，但有沒有你特別喜歡？） – spikem 2012-03-18 19:59:40

@spikem你期待什麼？你有一個非ASCII字符的字符串（非空格）。你不能把它們魔法化。 – katrielalex 2012-03-18 20:01:55

我不認爲我問過，或者預計他們會被抹去，我只是不完全熟悉unicode，這就是我來這裏問的原因。 – spikem 2012-03-18 20:05:18

使用BeautifulSoup從文本/ html文檔獲取乾淨的文本

回答

相關問題