我正在刮一個網站,其中包含大量調用頁面時生成的javascript。因此,傳統的網頁抓取方法(beautifulsoup等)不適用於我的目的(至少我沒有成功讓他們工作,所有重要數據都在javascript部分中)。因此我開始使用硒webdriver。我需要刮幾百頁,每頁有10到80個數據點(每個數據點大約有12個字段),所以重要的是這個腳本(是否是正確的術語?)可以運行一段時間,而不需要我保姆。Selenium Webdriver Python-頁面加載不完整/有時在刷新時凍結
我有一個單一頁面的代碼工作,我有一個控制部分,告訴刮板部分刮。問題在於,有時頁面的JavaScript部分會加載,有時候它們不會在(〜1/7)時刷新,但刷新會修復webdriver,從而將python運行時環境凍結爲好。令人煩惱的是,當它像這樣凍結時,代碼無法超時。到底是怎麼回事?
這裏是我的代碼一個精簡版:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple("HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[@id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
我有類似的問題廣泛地看,我不認爲任何人有上發佈有關此就這麼,或其他任何地方,我看了。我正在使用python 2.7,selenium 2.39.0,我試圖刮擦Healthcare.gov的高級估計頁面
編輯:(例如,this page)也可能值得一提的是,頁面無法加載當計算機一直運行/執行此操作一段時間時(我猜測免費內存已滿並在加載時出現小故障)時,這種情況通常會更加常見,但這種情況並非如此,因爲這應該通過嘗試來處理/除。
EDIT2:我還要提到的是,這是對正在windows7的64位運行,用Firefox 17(我認爲是最新的支持版本)