2012-03-18 149 views
1

我有一個文檔有兩種內容類型:text/xml和text/html。我想用BeautifulSoup來解析文檔,並最終得到一個乾淨的文本版本。該文檔以元組的形式開始,因此我一直使用repr將其變成BeautifulSoup識別的內容,然後使用find_all通過搜索div來查找文檔的文本/ html位,如下所示:使用BeautifulSoup從文本/ html文檔獲取乾淨的文本

soup = BeautifulSoup(repr(msg_data)) 
text = soup.html.find_all("div") 

然後,我將文本轉換回字符串,將其保存到一個變量,然後把它放回湯對象並調用get_text就可以了,就像這樣:

str_text = str(text) 
soup_text = BeautifulSoup(str_text) 
soup_text.get_text() 

然而,然後改變編碼爲unicode,如下所示:

u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17  
PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 
9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while 
browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, 
\xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives 
them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 

當我試圖重新編碼爲UTF-8,像這樣:

soup.encode('utf-8') 

我回未解析類型。

我想讓我把乾淨的文本保存爲一個字符串,然後我可以在文本中找到特定的東西(例如,上面的文本中的「小狗」)。

基本上,我在這裏跑來跑去。誰能幫忙?與往常一樣,非常感謝您爲您提供的任何幫助。

回答

2

編碼不被破壞;這正是它應該的。 '\xa0'是非破壞性空間的Unicode。

如果你想這個(Unicode)的字符串作爲ASCII編碼,你可以告訴編解碼器忽略任何字符不理解:

>>> x = u'[9:16 PM\xa0Erica: with images, \xa0\xa0and that seemed long to me anyway, 9:17 PM\xa0me: yeah, \xa0Erica: so feel free to make it shorter, \xa0\xa0or rather, please do, 9:18 PM\xa0nobody wants to read about that shit for 2 pages, \xa0me: :), \xa0Erica: while browsing their site, \xa0me: srsly, \xa0Erica: unless of course your writing is magic, \xa0me: My writing saves drowning puppies, \xa0\xa0Just plucks him right out and gives them a scratch behind the ears and some kibble, \xa0Erica: Maine is weird, \xa0me: haha]' 
>>> x.encode('ascii', 'ignore') 
'[9:16 PMErica: with images, and that seemed long to me anyway, 9:17 PMme: yeah, Erica: so feel free to make it shorter, or rather, please do, 9:18 PMnobody wants to read about that shit for 2 pages, me: :), Erica: while browsing their site, me: srsly, Erica: unless of course your writing is magic, me: My writing saves drowning puppies, Just plucks him right out and gives them a scratch behind the ears and some kibble, Erica: Maine is weird, me: haha]' 

如果你有時間,你應該看斯內德爾德最近視頻Pragmatic Unicode。它會使一切變得簡單明瞭!

+0

是的,它發生在我身上,正如我發佈的那樣,「毀了」有點強烈,現在就編輯它。 謝謝你的視頻,我會看看。你是否有任何我可以仔細閱讀的文本資源(我知道這些只是谷歌搜索了,但有沒有你特別喜歡?) – spikem 2012-03-18 19:59:40

+0

@spikem你期待什麼?你有一個非ASCII字符的字符串(非空格)。你不能把它們魔法化。 – katrielalex 2012-03-18 20:01:55

+0

我不認爲我問過,或者預計他們會被抹去,我只是不完全熟悉unicode,這就是我來這裏問的原因。 – spikem 2012-03-18 20:05:18

相關問題