忽略的內容用美麗的湯

https://en.wikipedia.org/wiki/America

我需要抓住的H2，H3和p標籤中的內容。不過，我想忽略標題和內容：

「另見」
「註釋」
「參考」
忽略所有表/網址

如何我會在美麗的湯中做到這一點嗎？我當前的代碼如下：

def open_document(): 
    for i in range (1, 1+1): 
     with open(directory_of_raw_documents + str(i), "r") as document: 
      html = document.read() 
      soup = BeautifulSoup(html, "html.parser") 
      body = soup.find('div', id='bodyContent') 
      results = "" 
      for item in body.find_all(['h2','h3','p']): 
       results += item.get_text() + "\n" 
       results = results.replace("[edit]","") 
      print(results) 

open_document()

我所需的輸出不會有任何表中的任何內容，查看所有，Notes或參考部分。我寧願不使用維基百科的模塊在Python 2.7

來源

2016-11-04 Jorge

soup.find(something)

意味着你找到整個文檔中的東西，如果你想忽略的一些內容，你需要的情況下縮小範圍，在你，你可以用途：

soup.find(id = 'bodyContent') #this narrow the scope to the main content.

比你可以使用find_all：

soup.find(id = 'bodyContent').find_all(name=['h2','h3','p'], href=False)

來源

2016-11-17 05:05:24

忽略的內容用美麗的湯

回答

相關問題