剝離html標籤，並返回僅在python中使用機械化的文本

我編寫了一個代碼，通過使用機械化給出搜索詞從網站提取信息。結果有html標籤，並與text.i以及其他細節需要只提取text.help我修改剝離html標籤，並返回僅在python中使用機械化的文本

import mechanize 
br=mechanize.Browser() 
br.set_handle_robots(False) 
br.addheaders = [('User-agent', 'Firefox')] 
r=br.open("http://www.drugs.com/search-wildcard-phonetic.html") 
br.select_form(nr=0) 
br.form['searchterm']='panadol' 
br.submit() 
print br.response().read()

來源

2014-02-23 FathimaBeevi

您是否正在尋找特定標籤內的某些特定文本？ – SpencerD

@SpencerGrantDoak是的 – FathimaBeevi

我強烈推薦使用正則表達式。我沒有使用機械化，但我假設'br.response（）。read（）'返回一個字符串。如果是這樣，你可以導入正則表達式並獲取html標籤中的數據。 – SpencerD

這似乎是爲 Python code to remove HTML tags from a string指向Strip HTML from strings in Python

同一問題的代碼

複製從這個問題上給出了答案：

我一直用這個功能剝離HTML標籤，因爲它僅需要Python的STDLIB：

from HTMLParser import HTMLParser 

class MLStripper(HTMLParser): 
    def __init__(self): 
    self.reset() 
    self.fed = [] 
    def handle_data(self, d): 
    self.fed.append(d) 
    def get_data(self): 
    return ''.join(self.fed) 

def strip_tags(html): 
    s = MLStripper() 
    s.feed(html) 
    return s.get_data()

來源

2014-02-23 18:39:26 sabbahillel

剝離html標籤，並返回僅在python中使用機械化的文本

回答

相關問題