2015-11-13 90 views
1

我用下面的提取我從亞馬遜上市所需要的HTML:提取圖像使用使用python beautifulsoup

import requests 
from bs4 import BeautifulSoup 

r=requests.get("http://rads.stackoverflow.com/amzn/click/B0007RXSB4") 
r.content 

soup=BeautifulSoup(r.content) 

soup.find_all("div", {"id":"imgTagWrapperId"}) 

這給了我這樣的:

[<div class="imgTagWrapper" id="imgTagWrapperId">\n<img alt="Johnston   
&amp; Murphy Men's Greenwich Oxford,Black,6 D" class="a-dynamic-image 
a-stretch-vertical" data-a-dynamic-image='{"http://ecx.images- 
amazon.com/images/I/81zwayZox-S._UY695_.jpg": 
[695,695],"http://ecx.images-amazon.com/images/I/81zwayZox- 
S._UY535_.jpg":[535,535],"http://ecx.images- 
amazon.com/images/I/81zwayZox-S._UY500_.jpg": 
[500,500],"http://ecx.images-amazon.com/images/I/81zwayZox- 
S._UY575_.jpg":[575,575],"http://ecx.images- 
amazon.com/images/I/81zwayZox-S._UY395_.jpg": 
[395,395],"http://ecx.images-amazon.com/images/I/81zwayZox- 
S._UY585_.jpg":[585,585]}' data-old-hires="http://ecx.images- 
amazon.com/images/I/81zwayZox-S._UL1500_.jpg" id="landingImage" 
onload="this.onload='';setCSMReq('af');if(typeof addlongPoleTag === 
'function'){ addlongPoleTag('af','desktop-image-atf- 
marker');};setCSMReq('cf')" src="http://ecx.images- 
amazon.com/images/I/41KixMIlPNL._SY395_QL70_.jpg" style="max- 
width:695px;max-height:695px;">\n</img></div>] 

我只需要知道如何從上面的代碼中提取http://ecx.images-amazon.com/images/I/81zwayZox-S._UY695_.jpg

回答

3

首先,您需要在已找到的div內找到img標記。一種方法是,以鏈find()電話:

img = soup.find("div", {"id": "imgTagWrapperId"}).find("img") 

或者,用CSS selector

img = soup.select_one("div#imgTagWrapperId > img") 

然後,如果你需要在src屬性的圖像URL:

img["src"] 

如果您需要data-a-dynamic-image屬性內的圖片網址,我建議您將該值加載到Python字典中,並使用json模塊樂並獲得keys()

import json 

img = soup.find("div", {"id": "imgTagWrapperId"}).find("img") 
data = json.loads(img["data-a-dynamic-image"]) 
print(list(data.keys())) 

打印:

[ 
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY695_.jpg', 
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY575_.jpg',  
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY500_.jpg',  
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY395_.jpg',  
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY535_.jpg',  
    u'http://ecx.images-amazon.com/images/I/81zwayZox-S._UY585_.jpg' 
]