BeautifulSoup4中兩個標籤之間的段落

我對編程，Python和BS4相當陌生，我希望通過網絡爬蟲項目更好。我有一堆類似的信息，我想單獨分頁。這裏是什麼，我需要使用的模板：BeautifulSoup4中兩個標籤之間的段落

<h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p>

的「導演」和信息「書面」是很容易收集，但我想有大綱和投段落爲好。問題在於網站上的故事梗概並不總是三段（有時更少，有時更多），所以我無法對其進行硬編碼。我的想法是使用文本中的「概要」一詞作爲起點和關鍵點，並收集所有內容，我不知道如何實現這一點。我試圖用正則表達式工作，但我不知道那麼多，我不知道如何在正則表達式中使用html標記。

任何幫助，將不勝感激。

來源

2017-07-30 Zoltán Buka

你想要的一切所示在灰色的框？ –

from bs4 import BeautifulSoup 

text = """<h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p>""" 

soup = BeautifulSoup(text, "html.parser") 

synopsis = '' 
for para in soup.find_all("p"): 
    if para.get('class') == ['writerDirector']: 
     break 
    synopsis += para.text + '\n' 

print(synopsis)

輸出：

First part of synopsis 
Second part of paragraph 
Third part of paragraph

獲取案例需要有點硬編碼：

cast_text = text[text.index('<h4>Cast</h4>'):] 

soup = BeautifulSoup(cast_text, "html.parser") 

cast_members = '' 
for para in soup.find_all('p'): 
    cast_members += para.text + '\n' 

print(cast_members)

輸出：

List of the cast in one line

來源

2017-07-30 16:27:48

非常感謝您的回答，這正是我所需要的，您的代碼真的很容易理解。 –

這可能捕捉技術的要領做你需要的。

您知道所需內容以H3元素開頭。然後你開始瀏覽其next_siblings。兄弟姐妹如空行（'\ n'）的sibling.name爲None，我們可以安全地忽略它們。此代碼爲H3元素的每個兄弟顯示sibling.name並完成sibling。你已經表明你已經知道如何挖掘這些內容。

現在你所要做的就是編寫代碼，在代碼中看到h4元素用於'Cast'，這樣它就可以安排爲劇組中的玩家多讀一個p元素。

>>> HTML = '''\ 
... <h3>Synopsis</h3> 
... <p>First part of synopsis</p> 
... <p>Second part of paragraph</p> 
... <p>Third part of paragraph</p> 
... <p class="writerDirector"><strong>Written By:</strong> Writer<br> 
... <strong>Directed By:</strong> Director</p> 
... <h4>Cast</h4> 
... <p>List of the cast in one line</p> 
... ''' 
>>> import bs4 
>>> soup = bs4.BeautifulSoup(HTML, 'lxml') 
>>> h3 = soup.find('h3') 
>>> for sibling in h3.next_siblings: 
...  if sibling.name: 
...   sibling.name 
...   sibling 
...   
'p' 
<p>First part of synopsis</p> 
'p' 
<p>Second part of paragraph</p> 
'p' 
<p>Third part of paragraph</p> 
'p' 
<p class="writerDirector"><strong>Written By:</strong> Writer<br/> 
<strong>Directed By:</strong> Director</p> 
'h4' 
<h4>Cast</h4> 
'p' 
<p>List of the cast in one line</p>

來源

2017-07-30 16:51:59

感謝您的提示，瞭解您的思維方式非常有用。 –

假設你有一個以上的簡介頁面（即使你不這樣做），你可以遍歷湯和收集H3簡介標籤之間的一切：

from bs4 import BeautifulSoup 

html ="""<html><h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p> 
<h3>Synopsis</h3> 
<p>First part of synopsis 2</p> 
<p>Second part of paragraph 2</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer 2<br> 
<strong>Directed By:</strong> Director 2</p> 
<h4>Cast</h4> 
<p>List of the cast in one line 2</p></html>""" 


soup = BeautifulSoup(html, 'lxml') 
value = "" 
start = False 

for i in soup.find_all(): 
    if i.name == 'h3' and i.string=='Synopsis': 
     if start: 
      print (value) 
      value = "" 
     print ("Synopsis") 
     start = True 
    elif i.text is not None and start: 
     value = value + " " + i.text 
if value: 
    print (value)

來源

2017-07-30 17:00:00

BeautifulSoup4中兩個標籤之間的段落

回答

相關問題