2014-02-24 47 views
4

我正在刮一個網站,其中包含大量調用頁面時生成的javascript。因此,傳統的網頁抓取方法(beautifulsoup等)不適用於我的目的(至少我沒有成功讓他們工作,所有重要數據都在javascript部分中)。因此我開始使用硒webdriver。我需要刮幾百頁,每頁有10到80個數據點(每個數據點大約有12個字段),所以重要的是這個腳本(是否是正確的術語?)可以運行一段時間,而不需要我保姆。Selenium Webdriver Python-頁面加載不完整/有時在刷新時凍結

我有一個單一頁面的代碼工作,我有一個控制部分,告訴刮板部分刮。問題在於,有時頁面的JavaScript部分會加載,有時候它們不會在(〜1/7)時刷新,但刷新會修復webdriver,從而將python運行時環境凍結爲好。令人煩惱的是,當它像這樣凍結時,代碼無法超時。到底是怎麼回事?

這裏是我的代碼一個精簡版:

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support.ui import Select 
from selenium.common.exceptions import NoSuchElementException, TimeoutException 
import time, re, random, csv 
from collections import namedtuple 

def main(url_full): 
driver = webdriver.Firefox() 
driver.implicitly_wait(15) 
driver.set_page_load_timeout(30) 


#create HealthPlan namedtuple 
HealthPlan = namedtuple("HealthPlan", ("State, County, FamType, Provider, PlanType,  Tier,") + 
         (" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") + 
         (" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct")) 


#check whether the page has loaded and handle page load and time out errors 
pageNotLoaded= bool(True) 
while pageNotLoaded: 
    try: 
     driver.get(url_full) 
     time.sleep(6+ abs(random.normalvariate(1.8,3))) 
    except TimeoutException: 
     driver.quit() 
     time.sleep(3+ abs(random.normalvariate(1.8,3))) 
     driver.get(url_full) 
     time.sleep(6+ abs(random.normalvariate(1.8,3))) 
    # Handle page load error by testing presence of showAll, 
    # an important feature of the page, which only appears if everything else loads 

    try: 
     driver.find_element_by_xpath('//*[@id="showAll"]').text 
    # catch NoSuchElementException=>refresh page 
    except NoSuchElementException: 
     try: 
      driver.refresh() 

      # catch TimeoutException => quit and load the page 
      # in a new instance of firefox, 
      # I don't think the code ever gets here, because it freezes in the refresh 
      # and will not throw the timeout exception like I would like 
     except TimeoutException: 
      driver.quit() 
      time.sleep(3+ abs(random.normalvariate(1.8,3))) 
      driver.get(url_full) 
      time.sleep(6+ abs(random.normalvariate(1.8,3))) 

    pageNotLoaded= False 

    scrapePage() # this is a dummy function, everything from here down works fine, 

我有類似的問題廣泛地看,我不認爲任何人有上發佈有關此就這麼,或其他任何地方,我看了。我正在使用python 2.7,selenium 2.39.0,我試圖刮擦Healthcare.gov的高級估計頁面

編輯:(例如,this page)也可能值得一提的是,頁面無法加載當計算機一直運行/執行此操作一段時間時(我猜測免費內存已滿並在加載時出現小故障)時,這種情況通常會更加常見,但這種情況並非如此,因爲這應該通過嘗試來處理/除。

EDIT2:我還要提到的是,這是對正在windows7的64位運行,用Firefox 17(我認爲是最新的支持版本)

回答

2

哥們,time.sleep這是一個失敗!

這是什麼?

time.sleep(3+ abs(random.normalvariate(1.8,3))) 

試試這個:

class TestPy(unittest.TestCase): 

    def waits(self): 
     self.implicit_wait = 30 

或者這樣:

(self.)driver.implicitly_wait(10) 

或者這樣:

WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath')) 

或者,而不是driver.refresh()你可以欺騙它:

driver.get(your url) 

您也可以依次點擊餅乾:

driver.delete_all_cookies() 


scrapePage() # this is a dummy function, everything from here down works fine, : 

http://scrapy.org

相關問題