它看起來像你需要設置cookie,然後才能看到子類別頁上的內容。所以,如果我理解這個問題rght:
import requests
from bs4 import BeautifulSoup
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser')
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
for div in divs:
print (div.find('a').text)
輸出:
Balsam Fir 15 ml
Balsam Fir 30 ml
Balsam Fir 5 ml
Basil Essential Oil 15ml
Basil Essential Oil 30ml
Basil Essential Oil 3ml
Basil Essential Oil 5ml
Bergamot Essential Oil 15ml
...
如果你只是想唯一的名稱脫光大小了與正則表達式,並添加到組:
import requests
from bs4 import BeautifulSoup
import re
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser')
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
a = set()
for div in divs:
text = div.find('a').text
a.add(re.sub('\s*\d+\s*ml$', '', text))
print (a)
輸出:
{'Lavender, Bulgarian Essential Oil', 'Thyme, White', 'Mandarin, Red Essential Oil', 'Pine Needle Essential Oil', 'Lemongrass Essential Oil', 'Fir Needle, Siberian', 'Spruce', 'Peppermint', 'Lime Essential Oil', 'Myrrh', 'Juniper Essential Oil', 'Petitgrain', 'Wintergreen', 'Lemon Essential Oil', 'Palmarosa', 'Balsam Fir', 'Chamomile, Roman', 'Cypress', 'Citronella', 'Rosemary', 'Lemon myrtle Essential Oil', 'Clary Sage', 'Cinnamon Bark', 'Frankincense', 'Tangerine', 'Cocoa, Absolute', 'Spearmint', 'Ravensara Essential Oil', 'Spike Lavender Essential Oil', 'Hyssop', 'Ylang Ylang', 'Basil Essential Oil', 'Bergamot Essential Oil', 'Fir Needle, Siberian1', 'Geranium Bourbon', 'Patchouli', 'Black Pepper Essential Oil', 'Fennel', 'Grapefruit Essential Oil', 'Eucalyptus', 'Carrot Seed Essential Oil', 'Chamomile, German', 'Vetiver', 'Tea Tree', 'Ginger', 'Marjoram, Sweet', 'Clove Bud'}
請分享示例html –
,因爲文檔格式不正確,我會分享鏈接,我會建議使用chrome開發工具進行查看。我試圖提取每個產品名稱的第一個實例(每行的第一列)。 – rickyjoepr
@rickyjoepr所以你想獲得網站上每個產品的鏈接? –