2016-02-16 163 views
0

我想用Google圖片搜索下載批量圖片。用Python3颳去Google圖片(請求+ BeautifulSoup)

我的第一種方法;將頁面源文件下載到一個文件,然後用open()打開它可以正常工作,但我希望能夠通過運行腳本和更改關鍵字來獲取圖像URL。

第一種方法:轉到圖像搜索(https://www.google.no/search?q=tower&client=opera&hs=UNl&source=lnms&tbm=isch&sa=X&ved=0ahUKEwiM5fnf4_zKAhWIJJoKHYUdBg4Q_AUIBygB&biw=1920&bih=982)。在瀏覽器中查看頁面源並將其保存爲html文件。當我然後open()與腳本的HTML文件,該腳本按預期工作,我得到了搜索頁上圖像的所有網址的整齊列表。這是腳本的第6行(取消註釋以測試)。

但是如果我使用requests.get()函數來解析網頁,如圖腳本的7號線,它取一個不同 html文件,不包含圖像的完整URL,所以我不能提取他們。

請幫我提取正確的圖像網址。

編輯:鏈接到tower.html,我使用:https://www.dropbox.com/s/yy39w1oc8sjkp3u/tower.html?dl=0

這是代碼,我至今寫:

import requests 
from bs4 import BeautifulSoup 

# define the url to be scraped 
url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982' 

# top line is using the attached "tower.html" as source, bottom line is using the url. The html file contains the source of the above url. 
#page = open('tower.html', 'r').read() 
page = requests.get(url).text 

# parse the text as html 
soup = BeautifulSoup(page, 'html.parser') 

# iterate on all "a" elements. 
for raw_link in soup.find_all('a'): 
    link = raw_link.get('href') 
    # if the link is a string and contain "imgurl" (there are other links on the page, that are not interesting... 
    if type(link) == str and 'imgurl' in link: 
     # print the part of the link that is between "=" and "&" (which is the actual url of the image, 
     print(link.split('=')[1].split('&')[0]) 

回答

0

只是讓你意識到:

# http://www.google.com/robots.txt 

User-agent: * 
Disallow: /search 



我想說我的回答是說Google很大程度上依賴於腳本。您很可能會得到不同的結果,因爲您通過reqeusts請求的頁面對頁面上提供的script s沒有做任何操作,而在Web瀏覽器中加載該頁面卻可以。

Here's what i get when I request the url you supplied

我回來從requests.get(url).text的文本不包含在它'imgurl'任何地方。你的腳本正在尋找它作爲其標準的一部分,它不在那裏。

但是,我確實看到一堆<img>標籤,src屬性設置爲圖像url。如果這就是你以後,比試試這個腳本:

import requests 
from bs4 import BeautifulSoup 

url = 'https://www.google.no/search?q=tower&client=opera&hs=cTQ&source=lnms&tbm=isch&sa=X&ved=0ahUKEwig3LOx4PzKAhWGFywKHZyZAAgQ_AUIBygB&biw=1920&bih=982' 

# page = open('tower.html', 'r').read() 
page = requests.get(url).text 

soup = BeautifulSoup(page, 'html.parser') 

for raw_img in soup.find_all('img'): 
    link = raw_img.get('src') 
    if link: 
    print(link) 

它返回的結果如下:

https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyxRHrFw0NM-ZcygiHoVhY6B6dWwhwT4va727380n_IekkU9sC1XSddAg 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfuhcCcOnC8DmOfweuWMKj3cTKXHS74XFh9GYAPhpD0OhGiCB7Z-gidkVk 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSOBZ9iFTXR8sGYkjWwPG41EO5Wlcv2rix0S9Ue1HFcts4VcWMrHkD5y10 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTEAZM3UoqqDCgcn48n8RlhBotSqvDLcE1z11y9n0yFYw4MrUFucPTbQ0Ma 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcSJvthsICJuYCKfS1PaKGkhfjETL22gfaPxqUm0C2-LIH9HP58tNap7bwc 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQGNtqD1NOwCaEWXZgcY1pPxQsdB8Z2uLGmiIcLLou6F_1c55zylpMWvSo 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSdRxvQjm4KWaxhAnJx2GNwTybrtUYCcb_sPoQLyAde2KMBUhR-65cm55I 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQLVqQ7HLzD7C-mZYQyrwBIUjBRl8okRDcDoeQE-AZ2FR0zCPUfZwQ8Q20 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQHNByVCZzjSuMXMd-OV7RZI0Pj7fk93jVKSVs7YYgc_MsQqKu2v0EP1M0 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcS_RUkfpGZ1xJ2_7DCGPommRiIZOcXRi-63KIE70BHOb6uRk232TZJdGzc 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSxv4ckWM6eg_BtQlSkFP9hjRB6yPNn1pRyThz3D8MMaLVoPbryrqiMBvlZ 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQWv_dHMr5ZQzOj8Ort1gItvLgVKLvgm9qaSOi4Uomy13-gWZNcfk8UNO8 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRRwzRc9BJpBQyqLNwR6HZ_oPfU1xKDh63mdfZZKV2lo1JWcztBluOrkt_o 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQdGCT2h_O16OptH7OofZHNvtUhDdGxOHz2n8mRp78Xk-Oy3rndZ88r7ZA 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRnmn9diX3Q08e_wpwOwn0N7L1QpnBep1DbUFXq0PbnkYXfO0wBy6fkpZY 
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSaP9Ok5n6dL5K1yKXw0TtPd14taoQ0r3HDEwU5F9mOEGdvcIB0ajyqXGE 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTcyaCvbXLYRtFspKBe18Yy5WZ_1tzzeYD8Obb-r4x9Yi6YZw83SfdOF5fm 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnS1qCjeYrbUtDSUNcRhkdO3fc3LTtN8KaQm-rFnbj_JagQEPJRGM-DnY0 
https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSiX_elwJQXGlToaEhFD5j2dBkP70PYDmA5stig29DC5maNhbfG76aDOyGh 
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQb3ughdUcPUgWAF6SkPFnyiJhe9Eb-NLbEZl_r7Pvt4B3mZN1SVGv0J-s 
+0

我曾嘗試使用urllib的,這主要是給了我「禁止」回刮,這是我相信是因爲禁止,你提到。 urllib適用於除谷歌圖像以外的任何內容。 我知道在請求解析的文本中沒有「imgurl」-s。 你得到的結果是圖像的縮略圖。這比沒有好,但我想收穫全分辨率的圖像。 問題是解析從不包含那個。有沒有什麼辦法可以讓請求遵循腳本,並且實際上讓它獲取源圖像的​​地址? –

+0

這就是爲什麼它給你「禁止」回來。他們已經構建了一個完整的模塊來解析網站的robots.txt文件,並確定是否允許抓取。您可以嘗試使用're'庫並使用正則表達式來查找值。但是,我認爲Google的搜索頁面很難找到......他們很難找到原因。 – ngoue

+0

無論如何,感謝編輯提取縮略圖:) –