您的鏈接錯誤。你可以使用下面的代碼,看看它是否符合你的需求。
你只需要通過searchTerm
,程序將打開谷歌頁面並獲取20張圖片的網址。
代碼:
def get_images_links(searchTerm):
import requests
from bs4 import BeautifulSoup
searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)
d = requests.get(searchUrl).text
soup = BeautifulSoup(d, 'html.parser')
img_tags = soup.find_all('img')
imgs_urls = []
for img in img_tags:
if img['src'].startswith("http"):
imgs_urls.append(img['src'])
return(imgs_urls)
用法:
get_images_links('computer')
輸出:
['https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSeq5kKIsOg6zSM2bSrWEnYhpZEpmOYiiLzqf6qfwKzSVUoZ5rHoya75DM',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTBUesIhyt4CgASIUDruqvvMzUBFCuG_iV92NXjZPMtPE5v2G626bge0g0',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRYz8c6LUAiyuAsXkMrOH8DC56aFEMy63m8Fw8-ZdutB5EDpw1hl0y3xg',
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT33QNycX0Ghqhfqs7Masrk9uvp6d66VlD2djHFfqL4P6phZCJLxkSx0wnt',
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRUF11cLRzH2WNfiUJ3WeAOm7Veme0_GLfwoOCs3R5GTQDfcFHMgsNQlyo',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTxcTcv4NPTboVorbD4I-uJbYjY4KjAR5JaMvUXCg33CLDUqop8IufKNw',
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTU8MkWwhDgcobqn_H2N3SS7dPVwu3I-ki1Sa_4u5YOEt-rAfOk1Kb2jpHO',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQlGu_Y_dhu60UNyilmIUSuOjX5_UnmcWr2AXGJ0w6BmvCXUZissCrtPcw',
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQN7ItGvBHD1H9EMBC0ZFDMzNu5nt2L-EK1CKmQE4gRNtylalyTTJQxalY',
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQyFgwD4Wr20OImzk9Uc0gGGI2-7mYQAU6mJn2GEFkpgLTAqUQUm4KL0TUQwQ',
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQR0LFRaUGIadOO5_qolg9ZWegXW0OTghzBf1YzoIhpqkaiY1f3YNx4JnE',
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRuOk4nPPPaUdjnZl1pEwGwlfq25GjvZFsshmouB0QaV925KxHg43wJFWc6',
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR5aqLfB9SaFBALzp4Z2qToLeWqeUjqaS3EwNhi6faHRCxYCPMsjhmivKf8',
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR6deLi7H9DCaxJXJyP7lmoixad5Rgo1gBLfVQ35lEWrvpgPoyQJ8CcZ-4',
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSPQAfl2WB-AwziLan6NAzvzh2xVDu_XJEcjqSGOdnOJdffo7goWhrFd3wU',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSB3o5cP8DMk9GqT9wpB1N7q6JtREUwitghlXO65UD5s3xCoLj80QuDlzw',
'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ18lWMvzZcIZvKI36BUUpnBIaa5e4A3TUAVdxAs6hhJ-rod446dMrPph2V',
'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR8XZhvomXcafQehhetM1_ZXOufBvWmEDAbOsqX-fiU5Xu3U6uWAO3XW-M',
'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQiWudrcl9y0XbtC19abcPfSwO4N060ipv4znqxnpLYWX5UFO-QdzJatd0r',
'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQtgqDxef3AOsiyUk0J0MbXgZT8c0JsAW3UpoumSTMFSGXde3BETrGSqw']
編輯:
如果你想獲得超過20個網址,你必須找到一種方法來發送一個Ajax請求獲得頁面的其餘部分,或者您可以使用硒來模擬你之間的相互作用該網頁。
我用第二種方法(可能還有噸的其他方法可以做到這一點,如果你願意,你可以優化這個代碼很多):
代碼2:
def scrape_all_imgs_google(searchTerm):
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
def scroll_page():
for i in range(7):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(3)
def click_button():
more_imgs_button_xpath = '//*[@id="smb"]'
driver.find_element_by_xpath(more_imgs_button_xpath).click()
def create_soup():
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')
def find_imgs():
imgs_urls = []
for img in soup.find_all('img'):
try:
if img['src'].startswith('http'):
imgs_urls.append(img['src'])
except:
pass
#create webdriver
driver = selenium.webdriver.Chrome()
#define url using search term
searchUrl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchTerm)
#get url
driver.get(searchUrl)
try:
click_button()
scroll_page()
except:
scroll_page()
click_button()
#create soup only after we loaded all imgs when we scroll'ed the page down
create_soup()
#find imgs in soup
find_imgs()
#close driver
driver.close()
#return list of all img urls found in page
return imgs_urls
用法:
urls = scrape_all_imgs_google('computer')
print(len(urls))
print(urls)
輸出:
377
['https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcT5Hi9cdE5JPyGl6G3oYfre7uHEie6zM-8q3zQOek0VLqQucGZCwwKGgfoE', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR0tu_xIYB__PVvdH0HKvPd5n1K-0GVbm5PDr1Br9XTyJxC4ORU5e8BVIiF', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQqHh6ZR6k-7izTfCLFK09Md19xJZAaHbBafCej6S30pkmTOfTFkhhs-Ksn', and etc...
如果你不想使用此代碼,可以在Google Scraper看看,看看它是否有可能對您有用的任何方法。
對於我來說,鏈接似乎不正確:https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp& biw = 1600&bih = 770&q ='對於我而言,您的查詢具體是什麼,請指定 – warl0ck