的Python/BeautifoulSoup - 提取DIV內容檢查H1文字

我有一個HTML頁面是這樣的：的Python/BeautifoulSoup - 提取DIV內容檢查H1文字

<div class="class1"> 
    <div class="head"> 
     <h1 class="title">Title 1</h1> 
    <div class="body"> 
<!-- some body content --> 
    </div> 
    </div> 
</div> 

<div class="class1"> 
    <div class="head"> 
     <h1 class="title">Title 2</h1> 
    <div class="body"> 
<!-- some body content --> 
    </div> 
    </div> 
</div>

我需要提取從div內容與class body只有標題等於「Title 2」。由於它們的父容器沒有特定的ID或類，所以h1文本是識別所有div的唯一方法。目前我使用此代碼：

from bs4 import BeautifoulSoup 

# code to open the webpage 
soup = BeautifulSoup(data, 'lxml') 
body_content = soup.findAll('div', {'class':'class1'})[1]

但是，這是不是很優雅，因爲它假設我感興趣在div始終是頁面的第二個 - 它不檢查的標題。

來源

2016-11-30 Hyperion

嗯，我能想到的唯一的辦法就是象下面這樣：

soup = BeautifulSoup(html,"html.parser") 
    result_tags = soup.find_all(name='div',class_='class1') 
    body_content = [tag for tag in result_tags if 'Title 2' in tag.prettify()][0]

它比你原來的代碼更好，因爲它不承擔您的目標DIV是頁面的第二個。

來源

2016-11-30 09:04:28 Acepcs

html = '''<div class="class1"> 
    <div class="head"> 
     <h1 class="title">Title 1</h1> 
    <div class="body"> 
<!-- some body content --> 
    </div> 
    </div> 
</div> 

<div class="class1"> 
    <div class="head"> 
     <h1 class="title">Title 2</h1> 
    <div class="body"> 
<!-- some body content --> 
    </div> 
    </div> 
</div>''' 

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html, 'lxml') 
soup.find(lambda tag: tag.get('class')==['class1'] and 'Title 2' in tag.text)

或：

def T2_tag(tag): 
    return tag.get('class')==['class1'] and 'Title 2' in tag.text 
soup.find(T2_tag)

來源

2016-11-30 10:46:55

的Python/BeautifoulSoup - 提取DIV內容檢查H1文字

回答

相關問題