2017-08-04 69 views
0

有多個類都共享名稱「row」,在每個行類中,有多個類都共享名稱「column」。如何使用BeautifulSoup提取嵌套類的第一個實例

我想遍歷行類,只收集每行的第一列。

我然後打印出的數據

什麼是做到這一點的正確方法的鏈接內容是什麼?我嘗試了一個列表,但創建列表後,我不再能夠在對象上使用beautifulsoup函數。

這是鏈接的url:

https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils

rows = soup.find_all('div', attrs={'class': 'row'}) 

for row in rows: 
    col = row.find('div', attrs={'class': 'column'}) 
    link = col.find('a') 
    print link.contents 
+1

請分享示例html –

+0

,因爲文檔格式不正確,我會分享鏈接,我會建議使用chrome開發工具進行查看。我試圖提取每個產品名稱的第一個實例(每行的第一列)。 – rickyjoepr

+0

@rickyjoepr所以你想獲得網站上每個產品的鏈接? –

回答

1

它看起來像你需要設置cookie,然後才能看到子類別頁上的內容。所以,如果我理解這個問題rght:

import requests 
from bs4 import BeautifulSoup 
# You need to store cookies so use a session. 
s = requests.Session() 
# Reques a page to get cookie. 
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories") 
# Make the real request. 
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils") 
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div. 
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'}) 
# Get the a element text. 
for div in divs: 
    print (div.find('a').text) 

輸出:

Balsam Fir 15 ml 
Balsam Fir 30 ml 
Balsam Fir 5 ml 
Basil Essential Oil 15ml 
Basil Essential Oil 30ml 
Basil Essential Oil 3ml 
Basil Essential Oil 5ml 
Bergamot Essential Oil 15ml 
... 

如果你只是想唯一的名稱脫光大小了與正則表達式,並添加到組:

import requests 
from bs4 import BeautifulSoup 
import re 
# You need to store cookies so use a session. 
s = requests.Session() 
# Reques a page to get cookie. 
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories") 
# Make the real request. 
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils") 
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div. 
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'}) 
# Get the a element text. 
a = set() 
for div in divs: 
    text = div.find('a').text 
    a.add(re.sub('\s*\d+\s*ml$', '', text)) 
print (a) 

輸出:

{'Lavender, Bulgarian Essential Oil', 'Thyme, White', 'Mandarin, Red Essential Oil', 'Pine Needle Essential Oil', 'Lemongrass Essential Oil', 'Fir Needle, Siberian', 'Spruce', 'Peppermint', 'Lime Essential Oil', 'Myrrh', 'Juniper Essential Oil', 'Petitgrain', 'Wintergreen', 'Lemon Essential Oil', 'Palmarosa', 'Balsam Fir', 'Chamomile, Roman', 'Cypress', 'Citronella', 'Rosemary', 'Lemon myrtle Essential Oil', 'Clary Sage', 'Cinnamon Bark', 'Frankincense', 'Tangerine', 'Cocoa, Absolute', 'Spearmint', 'Ravensara Essential Oil', 'Spike Lavender Essential Oil', 'Hyssop', 'Ylang Ylang', 'Basil Essential Oil', 'Bergamot Essential Oil', 'Fir Needle, Siberian1', 'Geranium Bourbon', 'Patchouli', 'Black Pepper Essential Oil', 'Fennel', 'Grapefruit Essential Oil', 'Eucalyptus', 'Carrot Seed Essential Oil', 'Chamomile, German', 'Vetiver', 'Tea Tree', 'Ginger', 'Marjoram, Sweet', 'Clove Bud'} 
+0

是的丹,這是我期望的輸出,但我認爲會有美麗的湯只能得到行內的第一列,但這就是我所需要的 – rickyjoepr

+1

我這樣做的原因是因爲「洋甘菊,羅馬」是兩行中的第一行。這樣,如果頁面更改,您不會丟失數據。 –

+0

做你做的事情更有意義。正則表達式以非常直接的方式解決了這個問題,即使頁面發生變化,它仍然可以正常工作。我認爲這很好 – rickyjoepr

相關問題