BeautifulSoup嵌套類選擇器

我正在使用BeautifulSoup作爲項目。這裏是我的HTML結構BeautifulSoup嵌套類選擇器

<div class="container"> 
<div class="fruits"> 
    <div class="apple"> 
     <p>John</p> 
     <p>Sam</p> 
     <p>Bailey</p> 
     <p>Jack</p> 
     <ul> 
      <li>Sour</li> 
      <li>Sweet</li> 
      <li>Salty</li> 
     </ul> 
     <span>Fruits are good</span> 
    </div> 
    <div class="mango"> 
     <p>Randy</p> 
     <p>James</p> 
    </div> 
</div> 
<div class="apple"> 
    <p>Bill</p> 
    <p>Sean</p> 
</div> 
</div>

現在我要搶在格類「蘋果」這下類屬於文本「水果」

這是我到目前爲止已經試過... 。

for node in soup.find_all("div", class_="apple")

其返回...

比爾
肖恩

但我希望它只是返回...

約翰
山姆
貝利
傑克
酸酸
甜
鹹
水果都是不錯的

請注意，我不知道元素的內部，DIV CLASS =「蘋果」的確切結構可以有任何類型的類中不同的HTML元素。所以選擇器必須足夠靈活。

下面是完整的代碼，在那裏我需要添加此BeautifulSoup代碼...

class MySpider(CrawlSpider): 
name = 'dknnews' 
start_urls = ['http://www.example.com/uat-area/scrapy/all-news-listing/_recache'] 
allowed_domains = ['example.com'] 
def parse(self, response): 
     hxs = Selector(response) 
     soup = BeautifulSoup(response.body, 'lxml') 
     #soup = BeautifulSoup(content.decode('utf-8','ignore')) 
     nf = NewsFields() 
     ptype = soup.find_all(attrs={"name":"dknpagetype"}) 
     ptitle = soup.find_all(attrs={"name":"dknpagetitle"}) 
     pturl = soup.find_all(attrs={"name":"dknpageurl"}) 
     ptdate = soup.find_all(attrs={"name":"dknpagedate"}) 
     ptdesc = soup.find_all(attrs={"name":"dknpagedescription"}) 
     for node in soup.find_all("div", class_="apple"): <!-- THIS IS WHERE I NEED TO ADD THE BS CODE --> 
     ptbody = ''.join(node.find_all(text=True)) 
     ptbody = ' '.join(ptbody.split()) 
     nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore') 
     nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore') 
     nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore') 
     nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore') 
     nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore') 
     nf['bodytext'] = ptbody.encode('ascii', 'ignore') 
     yield nf 
     for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract(): 
     yield Request(url, callback=self.parse)

我不知道如何使用嵌套選擇與BeautifulSoup find_all？

任何幫助非常感謝。

感謝

來源

2017-02-04 Slyper

soup.select('.fruits .apple p')

使用CSSselector，很容易表達類。

soup.find(class_='fruits').find(class_="apple").find_all('p')

或者，你可以使用find()得到一步的p標籤一步

編輯：

[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]

使用strings發生器獲得所有div標籤下的字符串，stripped_strings將得到在結果中排除\n。

出來：

['John', 'Sam', 'Bailey', 'Jack', 'Sour', 'Sweet', 'Salty', 'Fruits are good']

全碼：

from bs4 import BeautifulSoup 
source_code = """<div class="container"> 
<div class="fruits"> 
    <div class="apple"> 
     <p>John</p> 
     <p>Sam</p> 
     <p>Bailey</p> 
     <p>Jack</p> 
     <ul> 
      <li>Sour</li> 
      <li>Sweet</li> 
      <li>Salty</li> 
     </ul> 
     <span>Fruits are good</span> 
    </div> 
    <div class="mango"> 
     <p>Randy</p> 
     <p>James</p> 
    </div> 
</div> 
<div class="apple"> 
    <p>Bill</p> 
    <p>Sean</p> 
</div> 
</div> 
""" 
soup = BeautifulSoup(source_code, 'lxml') 
[s for div in soup.select('.fruits .apple') for s in div.stripped_strings]

來源

2017-02-04 08:21:55

感謝您的答覆。我已經更新了這個問題.... – Slyper

謝謝，但在類'apple'的代碼中有2個div。你認爲你的代碼只會針對屬於「水果」類的div嗎？ – Slyper

@Puneet夏爾馬你改變了你的輸出的問題，我更新它。 –

BeautifulSoup嵌套類選擇器

回答

相關問題