使用BeautifulSoup在兩個h2標題之間獲取文本

我想抓取在Description之後和Next Header之前的文本。使用BeautifulSoup在兩個h2標題之間獲取文本

我知道：

In [8]: soup.findAll('h2')[6] 
Out[8]: <h2>Description</h2>

不過，我不知道怎麼搶的實際文本。問題是我有多個鏈接來做到這一點。一些有號碼：

          <h2>Description</h2> 

    <p>This is the text I want </p> 
<p>This is the text I want</p> 
             <h2>Next header</h2>

但是，有些則沒有：

>          <h2>Description</h2> 
>      This is the text I want     
> 
>          <h2>Next header</h2>

而且在每一個與p，我不能只是做soup.findAll（ 'P'） 22]，因爲在某些'p'是21或20.

來源

2017-03-15 user6754289

檢查NavigableString檢查下一個兄弟是否是文本節點或Tag檢查它是否是一個元素。

如果您的下一個兄弟是標頭，請打破循環。

from bs4 import BeautifulSoup, NavigableString, Tag 
import requests 

example = """<h2>Description</h2><p>This is the text I want </p><p>This is the text I want</p><h2>Next header</h2>""" 

soup = BeautifulSoup(example, 'html.parser') 
for header in soup.find_all('h2'): 
    nextNode = header 
    while True: 
     nextNode = nextNode.nextSibling 
     if nextNode is None: 
      break 
     if isinstance(nextNode, NavigableString): 
      print (nextNode.strip()) 
     if isinstance(nextNode, Tag): 
      if nextNode.name == "h2": 
       break 
      print (nextNode.get_text(strip=True).strip())

來源

2017-03-15 22:10:45 Zroq

這有效，但抓取所有的文本，當我只需要它在兩個頭之間。我會嘗試修改你給我的內容，看看它是否有效，謝謝！ – user6754289

使用BeautifulSoup在兩個h2標題之間獲取文本

回答

相關問題