2017-04-06 94 views
1

我試圖做圖像URL與PythonPython 3.x都有Beautifulsoup抓取圖像的URL

爬行作爲確認與開發工具的谷歌圖片搜索窗口的結果,大約有100圖像的URL

更多網址出現向下滾動。但是,這沒關係。

問題是我只有20個URL。

我在html文件中打開了一個可尋址的請求。

我確認那裏只有20個網址。

我認爲請求中只有20個圖片網址,所以只輸出20個圖片網址。

如何獲取所有圖片網址?

這是源代碼。

#-*- coding: utf-8 -*- 
import urllib.request 
from bs4 import BeautifulSoup 

if __name__ == "__main__": 
    print("Crawling!!!!!!!!!!!!!!!") 

    hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0)', 
      'referer' : 'http:google.com', 
      'Accept': 'text/html', 
      'Accept':'application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
      'Accept': 'none', 
      'Connection': 'keep-alive'} 

    inputSearch = "sites:pinterest+white+jeans" 
    req = urllib.request.Request("https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp&biw=1600&bih=770&q=" + inputSearch, headers = hdr) 
    data = urllib.request.urlopen(req).read() 

    bs = BeautifulSoup(data, "html.parser") 

    for img in bs.find_all('img'): 
     print(img.get('src')) 
+0

對於我來說,鏈接似乎不正確:https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp& biw = 1600&bih = 770&q ='對於我而言,您的查詢具體是什麼,請指定 – warl0ck

回答

0

您的鏈接錯誤。你可以使用下面的代碼,看看它是否符合你的需求。

你只需要通過searchTerm,程序將打開谷歌頁面並獲取20張圖片的網址。

代碼:

def get_images_links(searchTerm): 

    import requests 
    from bs4 import BeautifulSoup 

    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm) 
    d = requests.get(searchUrl).text 
    soup = BeautifulSoup(d, 'html.parser') 

    img_tags = soup.find_all('img') 

    imgs_urls = [] 
    for img in img_tags: 
     if img['src'].startswith("http"): 
      imgs_urls.append(img['src']) 

    return(imgs_urls) 

用法:

get_images_links('computer') 

輸出:

['https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeq5kKIsOg6zSM2bSrWEnYhpZEpmOYiiLzqf6qfwKzSVUoZ5rHoya75DM', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTBUesIhyt4CgASIUDruqvvMzUBFCuG_iV92NXjZPMtPE5v2G626bge0g0', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRYz8c6LUAiyuAsXkMrOH8DC56aFEMy63m8Fw8-ZdutB5EDpw1hl0y3xg', 
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT33QNycX0Ghqhfqs7Masrk9uvp6d66VlD2djHFfqL4P6phZCJLxkSx0wnt', 
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRUF11cLRzH2WNfiUJ3WeAOm7Veme0_GLfwoOCs3R5GTQDfcFHMgsNQlyo', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTxcTcv4NPTboVorbD4I-uJbYjY4KjAR5JaMvUXCg33CLDUqop8IufKNw', 
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTU8MkWwhDgcobqn_H2N3SS7dPVwu3I-ki1Sa_4u5YOEt-rAfOk1Kb2jpHO', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQlGu_Y_dhu60UNyilmIUSuOjX5_UnmcWr2AXGJ0w6BmvCXUZissCrtPcw', 
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQN7ItGvBHD1H9EMBC0ZFDMzNu5nt2L-EK1CKmQE4gRNtylalyTTJQxalY', 
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQyFgwD4Wr20OImzk9Uc0gGGI2-7mYQAU6mJn2GEFkpgLTAqUQUm4KL0TUQwQ', 
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQR0LFRaUGIadOO5_qolg9ZWegXW0OTghzBf1YzoIhpqkaiY1f3YNx4JnE', 
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRuOk4nPPPaUdjnZl1pEwGwlfq25GjvZFsshmouB0QaV925KxHg43wJFWc6', 
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR5aqLfB9SaFBALzp4Z2qToLeWqeUjqaS3EwNhi6faHRCxYCPMsjhmivKf8', 
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR6deLi7H9DCaxJXJyP7lmoixad5Rgo1gBLfVQ35lEWrvpgPoyQJ8CcZ-4', 
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSPQAfl2WB-AwziLan6NAzvzh2xVDu_XJEcjqSGOdnOJdffo7goWhrFd3wU', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSB3o5cP8DMk9GqT9wpB1N7q6JtREUwitghlXO65UD5s3xCoLj80QuDlzw', 
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ18lWMvzZcIZvKI36BUUpnBIaa5e4A3TUAVdxAs6hhJ-rod446dMrPph2V', 
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8XZhvomXcafQehhetM1_ZXOufBvWmEDAbOsqX-fiU5Xu3U6uWAO3XW-M', 
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQiWudrcl9y0XbtC19abcPfSwO4N060ipv4znqxnpLYWX5UFO-QdzJatd0r', 
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQtgqDxef3AOsiyUk0J0MbXgZT8c0JsAW3UpoumSTMFSGXde3BETrGSqw'] 

編輯:

如果你想獲得超過20個網址,你必須找到一種方法來發送一個Ajax請求獲得頁面的其餘部分,或者您可以使用硒來模擬你之間的相互作用該網頁。

我用第二種方法(可能還有噸的其他方法可以做到這一點,如果你願意,你可以優化這個代碼很多):

代碼2:

def scrape_all_imgs_google(searchTerm): 

    from selenium import webdriver 
    from bs4 import BeautifulSoup 
    from time import sleep 

    def scroll_page(): 
     for i in range(7): 
      driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
      sleep(3) 

    def click_button(): 
     more_imgs_button_xpath = '//*[@id="smb"]' 
     driver.find_element_by_xpath(more_imgs_button_xpath).click() 

    def create_soup(): 
     html_source = driver.page_source 
     soup = BeautifulSoup(html_source, 'html.parser') 

    def find_imgs(): 
     imgs_urls = [] 
     for img in soup.find_all('img'): 
      try: 
       if img['src'].startswith('http'): 
        imgs_urls.append(img['src']) 
      except: 
       pass 

    #create webdriver 
    driver = selenium.webdriver.Chrome() 

    #define url using search term 
    searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm) 

    #get url 
    driver.get(searchUrl) 

    try: 
     click_button() 
     scroll_page() 
    except: 
     scroll_page() 
     click_button() 

    #create soup only after we loaded all imgs when we scroll'ed the page down 
    create_soup() 

    #find imgs in soup 
    find_imgs() 

    #close driver 
    driver.close() 

    #return list of all img urls found in page 
    return imgs_urls  

用法:

urls = scrape_all_imgs_google('computer') 

print(len(urls)) 
print(urls) 

輸出:

377 
['https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcT5Hi9cdE5JPyGl6G3oYfre7uHEie6zM-8q3zQOek0VLqQucGZCwwKGgfoE', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR0tu_xIYB__PVvdH0HKvPd5n1K-0GVbm5PDr1Br9XTyJxC4ORU5e8BVIiF', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQqHh6ZR6k-7izTfCLFK09Md19xJZAaHbBafCej6S30pkmTOfTFkhhs-Ksn', and etc... 

如果你不想使用此代碼,可以在Google Scraper看看,看看它是否有可能對您有用的任何方法。

+0

謝謝。但我想獲得更多的網址,而不是20 我應該怎麼做? –

+0

@안진환我已經更新了我的答案。看看這個新函數:'scrape_all_imgs_google(searchTerm)'。 –

+0

@안진환我很高興能夠幫助你!歡迎來到StackOverflow。如果我的答案解決了您的問題,您可以[標記爲已接受](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)來結束您的問題。 –