2015-06-19 96 views
2

我試圖從航班搜索頁面刮取一些數據。使用硒緩慢滾動頁面

本頁面以這種方式工作:

你填表,然後單擊搜索按鈕上 - 這是確定的。當你點擊按鈕時,你將被重定向到帶有結果的頁面,這就是問題所在。這個頁面不斷地添加結果,例如一分鐘,這不是什麼大問題 - 問題是要獲得所有這些結果。當你在真正的瀏覽器中,你必須向下滾動頁面,這些結果纔會出現。所以我試着用Selenium向下滾動。它向下滾動到頁面的底部可能非常快,或者它是跳轉而不是滾動,頁面不會加載任何新結果。

當您慢慢向下滾動時,它會重新加載結果,但如果非常快速地執行,則會停止加載。

我不確定我的代碼是否有助於理解,所以我附加它。

SEARCH_STRING = """URL""" 

class spider(): 

    def __init__(self): 
     self.driver = webdriver.Firefox() 

    @staticmethod 
    def prepare_get(dep_airport,arr_airport,dep_date,arr_date): 
     string = SEARCH_STRING%(dep_airport,arr_airport,arr_airport,dep_airport,dep_date,arr_date) 
     return string 


    def find_flights_html(self,dep_airport, arr_airport, dep_date, arr_date): 
     if isinstance(dep_airport, list): 
      airports_string = str(r'%20').join(dep_airport) 
      dep_airport = airports_string 

     wait = WebDriverWait(self.driver, 60) # wait for results 
     self.driver.get(spider.prepare_get(dep_airport, arr_airport, dep_date, arr_date)) 
     wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]'))) 
     wait.until(EC.invisibility_of_element_located((By.XPATH, u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img'))) 
     self.driver.execute_script("window.scrollTo(0,document.body.scrollHeight);") 

     self.driver.find_element_by_xpath('//body').send_keys(Keys.CONTROL+Keys.END) 
     return self.driver.page_source 

    @staticmethod 
    def get_info_from_borderbox(div): 
     arrival = div.find('div',class_='departure').text 
     price = div.find('div',class_='pricebox').find('div',class_=re.compile('price')) 
     departure = div.find_all('div',class_='departure')[1].contents 
     date_departure = departure[1].text 
     airport_departure = departure[5].text 
     arrival = div.find_all('div', class_= 'arrival')[0].contents 
     date_arrival = arrival[1].text 
     airport_arrival = arrival[3].text[1:] 
     print 'DEPARTURE: ' 
     print date_departure,airport_departure 
     print 'ARRIVAL: ' 
     print date_arrival,airport_arrival 

    @staticmethod 
    def get_flights_from_result_page(html): 

     def match_tag(tag, classes): 
      return (tag.name == 'div' 
        and 'class' in tag.attrs 
        and all([c in tag['class'] for c in classes])) 

     soup = mLib.getSoup_html(html) 
     divs = soup.find_all(lambda t: match_tag(t, ['borderbox', 'flightbox', 'p2'])) 

     for div in divs: 
      spider.get_info_from_borderbox(div) 

     print len(divs) 


spider_inst = spider() 

print spider.get_flights_from_result_page(spider_inst.find_flights_html(['BTS','BRU','PAR'], 'MAD', '2015-07-15', '2015-08-15')) 

所以主要問題是在我看來,它滾動得太快以至於不能觸發新的加載結果。

你有什麼想法如何使它工作?

回答

1

這是一個不同的方法,對我來說,涉及滾動到最後一個搜索結果的看法並等待其他元素來再次滾動前加載工作:

# -*- coding: utf-8 -*- 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import StaleElementReferenceException 
from selenium.webdriver.support import expected_conditions as EC 


class wait_for_more_than_n_elements(object): 
    def __init__(self, locator, count): 
     self.locator = locator 
     self.count = count 

    def __call__(self, driver): 
     try: 
      count = len(EC._find_elements(driver, self.locator)) 
      return count >= self.count 
     except StaleElementReferenceException: 
      return False 


driver = webdriver.Firefox() 

dep_airport = ['BTS', 'BRU', 'PAR'] 
arr_airport = 'MAD' 
dep_date = '2015-07-15' 
arr_date = '2015-08-15' 

airports_string = str(r'%20').join(dep_airport) 
dep_airport = airports_string 

url = "https://www.pelikan.sk/sk/flights/list?dfc=C%s&dtc=C%s&rfc=C%s&rtc=C%s&dd=%s&rd=%s&px=1000&ns=0&prc=&rng=1&rbd=0&ct=0" % (dep_airport, arr_airport, arr_airport, dep_airport, dep_date, arr_date) 
driver.maximize_window() 
driver.get(url) 

wait = WebDriverWait(driver, 60) 
wait.until(EC.invisibility_of_element_located((By.XPATH, '//img[contains(@src, "loading")]'))) 
wait.until(EC.invisibility_of_element_located((By.XPATH, 
               u'//div[. = "Poprosíme o trpezlivosť, hľadáme pre Vás ešte viac letov"]/preceding-sibling::img'))) 

while True: # TODO: make the endless loop end 
    results = driver.find_elements_by_css_selector("div.flightbox") 
    print "Results count: %d" % len(results) 

    # scroll to the last element 
    driver.execute_script("arguments[0].scrollIntoView();", results[-1]) 

    # wait for more results to load 
    wait.until(wait_for_more_than_n_elements((By.CSS_SELECTOR, 'div.flightbox'), len(results))) 

注:

  • 您需要確定何時停止循環 - 例如,在特定的len(results)
  • wait_for_more_than_n_elementscustom Expected Condition這有助於識別下一部分何時加載,我們可以再次滾動
+0

恐怕它不起作用。它在循環中返回10,當我試圖把這個:for result result:print result.text我發現它返回相同的值。 –

+0

@Milan好吧,我看到隨着循環的每次迭代,結果數量都在增加,這意味着額外的結果正在加載。結束循環後提取結果。 – alecxe

+0

要檢查它是否正在查找新結果,我將結果添加到集合和集合的每個循環打印長度中。它保持在15.在這裏你可以找到代碼和結果打印:http://pastebin.com/fkUrCvAm –