2017-07-01 73 views
-2

內它HTTPS試圖讓包含HTTPS所有圖片src與SRC://與BeautifulSoup如何找到圖像使用BeautifulSoup

image_list = [] 
url = 'www.example.com' 
r = requests.get(url) 
soup = BeautifulSoup(r.content, "html5lib") 

for link in soup.find_all('img'): 
    image_list.append(link.get('src')) 

for link in image_list: 
    if 'https' not in link: 
     image_list.remove(link) 
+0

你的問題是什麼? – DeepSpace

+1

在迭代它時,不要從列表中刪除值。 –

+0

即時通訊嘗試從包含「https://」的網頁開始處獲取所有圖像src鏈接。 – lolz

回答

1

您可以檢查是否src開始與https,然後將其過濾,如:

from bs4 import BeautifulSoup 
image_list=[] 
div_test=""" 
<html> 
    <div id="d1"> 
     Text 1 
    </div> 
    <img src="http://test1.com/1.jpg"></img> 
    <div id="d2"> 
     Text 2 
     <a href="http://my.url/">a url</a> 
     Text 2 continue 
    </div> 
    <img src="https://test2.com/2.jpg"></img> 

    <div id="d3"> 
     Text 3 
    </div> 
    <img src="https://test3.com/3.jpg"></img> 
</html> 
""" 
soup = BeautifulSoup(div_test, 'html.parser') 
for link in soup.find_all('img'): 
    src = link.get('src') 
    if src.startswith("https"): #check src starts with https 
     image_list.append(src) 
print(image_list) 

image_list將只爲https

[u'https://test2.com/2.jpg', u'https://test3.com/3.jpg']