2017-07-30 82 views
0

我對編程,Python和BS4相當陌生,我希望通過網絡爬蟲項目更好。我有一堆類似的信息,我想單獨分頁。這裏是什麼,我需要使用的模板:BeautifulSoup4中兩個標籤之間的段落

<h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p> 

的「導演」和信息「書面」是很容易收集,但我想有大綱和投段落爲好。問題在於網站上的故事梗概並不總是三段(有時更少,有時更多),所以我無法對其進行硬編碼。我的想法是使用文本中的「概要」一詞作爲起點和關鍵點,並收集所有內容,我不知道如何實現這一點。我試圖用正則表達式工作,但我不知道那麼多,我不知道如何在正則表達式中使用html標記。

任何幫助,將不勝感激。

+0

你想要的一切所示在灰色的框? –

回答

1
from bs4 import BeautifulSoup 

text = """<h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p>""" 

soup = BeautifulSoup(text, "html.parser") 

synopsis = '' 
for para in soup.find_all("p"): 
    if para.get('class') == ['writerDirector']: 
     break 
    synopsis += para.text + '\n' 

print(synopsis) 

輸出:

First part of synopsis 
Second part of paragraph 
Third part of paragraph 

獲取案例需要有點硬編碼:

cast_text = text[text.index('<h4>Cast</h4>'):] 

soup = BeautifulSoup(cast_text, "html.parser") 

cast_members = '' 
for para in soup.find_all('p'): 
    cast_members += para.text + '\n' 

print(cast_members) 

輸出:

List of the cast in one line 
+0

非常感謝您的回答,這正是我所需要的,您的代碼真的很容易理解。 –

0

這可能捕捉技術的要領做你需要的。

您知道所需內容以H3元素開頭。然後你開始瀏覽其next_siblings。兄弟姐妹如空行('\ n')的sibling.nameNone,我們可以安全地忽略它們。此代碼爲H3元素的每個兄弟顯示sibling.name並完成sibling。你已經表明你已經知道如何挖掘這些內容。

現在你所要做的就是編寫代碼,在代碼中看到h4元素用於'Cast',這樣它就可以安排爲劇組中的玩家多讀一個p元素。

>>> HTML = '''\ 
... <h3>Synopsis</h3> 
... <p>First part of synopsis</p> 
... <p>Second part of paragraph</p> 
... <p>Third part of paragraph</p> 
... <p class="writerDirector"><strong>Written By:</strong> Writer<br> 
... <strong>Directed By:</strong> Director</p> 
... <h4>Cast</h4> 
... <p>List of the cast in one line</p> 
... ''' 
>>> import bs4 
>>> soup = bs4.BeautifulSoup(HTML, 'lxml') 
>>> h3 = soup.find('h3') 
>>> for sibling in h3.next_siblings: 
...  if sibling.name: 
...   sibling.name 
...   sibling 
...   
'p' 
<p>First part of synopsis</p> 
'p' 
<p>Second part of paragraph</p> 
'p' 
<p>Third part of paragraph</p> 
'p' 
<p class="writerDirector"><strong>Written By:</strong> Writer<br/> 
<strong>Directed By:</strong> Director</p> 
'h4' 
<h4>Cast</h4> 
'p' 
<p>List of the cast in one line</p> 
+1

感謝您的提示,瞭解您的思維方式非常有用。 –

0

假設你有一個以上的簡介頁面(即使你不這樣做),你可以遍歷湯和收集H3簡介標籤之間的一切:

from bs4 import BeautifulSoup 

html ="""<html><h3>Synopsis</h3> 
<p>First part of synopsis</p> 
<p>Second part of paragraph</p> 
<p>Third part of paragraph</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer<br> 
<strong>Directed By:</strong> Director</p> 
<h4>Cast</h4> 
<p>List of the cast in one line</p> 
<h3>Synopsis</h3> 
<p>First part of synopsis 2</p> 
<p>Second part of paragraph 2</p> 
<p class="writerDirector"><strong>Written By:</strong> Writer 2<br> 
<strong>Directed By:</strong> Director 2</p> 
<h4>Cast</h4> 
<p>List of the cast in one line 2</p></html>""" 


soup = BeautifulSoup(html, 'lxml') 
value = "" 
start = False 

for i in soup.find_all(): 
    if i.name == 'h3' and i.string=='Synopsis': 
     if start: 
      print (value) 
      value = "" 
     print ("Synopsis") 
     start = True 
    elif i.text is not None and start: 
     value = value + " " + i.text 
if value: 
    print (value)