我期待到建立一個解析器具有相似結構的谷歌的(即一串連續的結果頁面,每一個都具有感興趣的內容表)的網站。
Selenium包(用於基於頁面元素的網站導航)和BeautifulSoup(用於html解析)的組合似乎是收集書面內容的首選武器。您可能也會發現它們也很有用,儘管我不知道谷歌有哪些防禦措施可以阻止刮蹭。
使用硒,beautifulsoup和geckodriver Mozilla Firefox的一個可能的實現:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
在此之後,它只是一個調用的事:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
請注意,此腳本假定結果您嘗試刮取的頁面按順序編號,可以使用'find_element_by_link_text'從刮取頁面的html中檢索這些數字。有關從頁面檢索元素的其他策略,請參閱硒文檔here。
此外,請注意,您需要下載這些所依賴的軟件包,以及爲了與您的瀏覽器交流而需要的驅動程序(在這種情況下,geckodriver,下載geckodriver,將它放在一個文件夾中,然後參考到'executable_path'中的可執行文件)
如果確實最終使用了這些軟件包,它可以幫助使用時間包(原生爲python)分佈您的服務器請求,以避免超出允許的最大請求數服務器關閉你正在刮。我沒有最終需要它爲我自己的項目,但看到here,原始問題的第二個答案,對於在第四個代碼塊中使用時間模塊的實現示例。
Yeeeeaaaahhh ...如果有更高代表的人可以編輯並添加一些鏈接到美麗,硒和時間文件,這將是偉大的,thaaaanks。