2015-04-02 66 views
2

我試圖從BBC Good Food網站提取信息,但我收集的數據縮小了一些問題。如何用BeautifulSoup和Python刮頁?

這是我到目前爲止有:

from bs4 import BeautifulSoup 
import requests 

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato') 
soup = BeautifulSoup(webpage.content) 
links = soup.find_all("a") 

for anchor in links: 
    print(anchor.get('href')), anchor.text 

這將返回從問題的網頁加上鍊接的文字描​​述的所有鏈接,但我想提取從「文章」類型對象的鏈接在頁面上。這些是特定配方的鏈接。

通過一些試驗,我已成功地返回從文章的內容,但我似乎無法提取鏈接。

回答

4

的只有兩件事情我看到相關的文章標籤是在href和img.src:

from bs4 import BeautifulSoup 
import requests 

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato') 
soup = BeautifulSoup(webpage.content) 
links = soup.find_all("article") 

for ele in links: 
    print(ele.a["href"]) 
    print(ele.img["src"]) 

的鏈接在"class=node-title"

from bs4 import BeautifulSoup 
import requests 

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato') 
soup = BeautifulSoup(webpage.content) 


links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"}) 

for l in links: 
    print(l.a["href"]) 

/recipes/681646/tomato-tart 
/recipes/4468/stuffed-tomatoes 
/recipes/1641/charred-tomatoes 
/recipes/tomato-confit 
/recipes/1575635/roast-tomatoes 
/recipes/2536638/tomato-passata 
/recipes/2518/cherry-tomatoes 
/recipes/681653/stuffed-tomatoes 
/recipes/2852676/tomato-sauce 
/recipes/2075/tomato-soup 
/recipes/339605/tomato-sauce 
/recipes/2130/essence-of-tomatoes- 
/recipes/2942/tomato-tarts 
/recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa 
/recipes/3509/honey-and-thyme-tomatoes 

要訪問你需要預先http://www.bbcgoodfood.com

for l in links: 
     print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status 
200 
200 
200 
200 
200 
200 
200 
200 
200 
200 
+0

非常感謝您! – jm22b 2015-04-03 09:53:46

+0

沒有問題,不客氣。 – 2015-04-03 10:45:18