2012-11-15 150 views
1

我試圖使用MLStripper類,我發現建議在幾個貼子上去掉電子郵件中的html以獲得純文本。 strip_tags函數在嘗試解析時由於「@」符號而遇到問題。我猜這個類不夠強大,只能解析有效的html標籤,關於如何解決下面的問題以處理「@」或其他庫以從文本中移除html的任何建議?我還需要刪除諸如&之類的內容。解析HTML爲純文本

的Python:

from HTMLParser import HTMLParser 

class MLStripper(HTMLParser): 
      def __init__(self): 
       self.reset() 
       self.fed = [] 
      def handle_data(self, d): 
       self.fed.append(d) 
      def get_data(self): 
       return ''.join(self.fed) 

      def strip_tags(self, html): 
       s = MLStripper() 
       s.feed(html) 
       return s.get_data() 

ML = MLStripper() 
test = ML.strip_tags("<div><br>On Sep 27, 2012, at 4:11 PM, Mark Smith <[email protected]> wrote</br></div>") 
print test 

錯誤:

Traceback (most recent call last): 
    File "IMAPReader.py", line 49, in <module> 
    strippedText = ML.strip_tags("<[email protected]>") 
    File "IMAPReader.py", line 22, in strip_tags 
    s.feed(html) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 108, in feed 
    self.goahead(0) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 148, in goahead 
    k = self.parse_starttag(i) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 229, in parse_starttag 
    endpos = self.check_for_whole_start_tag(i) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 304, in check_for_whole_start_tag 
    self.error("malformed start tag") 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 115, in error 
    raise HTMLParseError(message, self.getpos()) 
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 9 

回答

2

如果你希望得到合法的標記,你不想要一個HTML解析器。退房BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

他們恰好有做你想做的事的一個很好的例子:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

from bs4 import BeautifulSoup 

html_doc = """ 
<html><head><title>The Dormouse's story</title></head> 

<p class="title"><b>The Dormouse's story</b></p> 

<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 

<p class="story">...</p> 
""" 

soup = BeautifulSoup(html_doc) 

print(soup.get_text()) 

回報......

# The Dormouse's story 
# 
# The Dormouse's story 
# 
# Once upon a time there were three little sisters; and their names were 
# Elsie, 
# Lacie and 
# Tillie; 
# and they lived at the bottom of a well. 
# 
# ... 
0

你使用的是什麼版本的Python?我使用Python 2.7.2運行你的代碼,我得到了你所做的同樣的錯誤。然後,我在一臺電腦上運行它,使用Python 2.7.3,它的工作原理非常完美。這很奇怪,所以我查了一下,發現一些文檔說在Python版本中包含的HTML解析器變得更加寬鬆。嘗試升級到2.7.3,它應該工作。