使用python無限滾動的抓取網站

我一直在做研究，到目前爲止，我發現了我打算使用它的scrapy的python包，現在我正試圖找出構建刮板的好方法使用scrapy抓取無限滾動的網站。挖掘周圍後，我發現有一個包調用硒，它有python模塊。我有一種感覺，有人已經使用Scrapy和Selenium來做無限滾動的網站。如果有人能夠指出一個例子，那將會很棒。使用python無限滾動的抓取網站

來源

2014-03-28 Null-Hypothesis

一種方法是觸發一些向下箭頭鍵讓瀏覽器向下滾動。 – donfuxx

看一下：http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page – alecxe

from selenium.webdriver.common.keys import Keys 
import selenium.webdriver 
driver = selenium.webdriver.Firefox() 
driver.get("http://www.something.com") 
lastElement = driver.find_elements_by_id("someId")[-1] 
lastElement.send_keys(Keys.NULL)

這將打開一個頁面，找到與給定id和滾動該元素進入視野最底層的元素。當頁面加載更多時，您將不得不查詢驅動程序以獲取最後一個元素，而且我發現這樣做會很慢，因爲頁面變大。時間主要由driver.find_element_*調用，因爲我不知道如何顯式查詢頁面中的最後一個元素。

通過實驗你可能會發現有一個上限，以元素的量在頁面加載動態的，如果你寫的東西，裝這個數字，然後纔打了一個電話給driver.find_element_*這將是最好的。

來源

2014-04-14 20:09:05 maxywb

你可以使用硒來取消像twitter或facebook這樣的無限滾動網站。

步驟1：使用PIP

pip install selenium

步驟2安裝硒：使用下面的代碼來自動無限滾動和提取源代碼

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support.ui import Select 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import NoSuchElementException 
from selenium.common.exceptions import NoAlertPresentException 
import sys 

import unittest, time, re 

class Sel(unittest.TestCase): 
    def setUp(self): 
     self.driver = webdriver.Firefox() 
     self.driver.implicitly_wait(30) 
     self.base_url = "https://twitter.com" 
     self.verificationErrors = [] 
     self.accept_next_alert = True 
    def test_sel(self): 
     driver = self.driver 
     delay = 3 
     driver.get(self.base_url + "https://stackoverflow.com/search?q=stackoverflow&src=typd") 
     driver.find_element_by_link_text("All").click() 
     for i in range(1,100): 
      self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
      time.sleep(4) 
     html_source = driver.page_source 
     data = html_source.encode('utf-8') 


if __name__ == "__main__": 
    unittest.main()

for循環允許通過所述解析無限滾動和後期，你可以提取加載的數據。

步驟3：根據需要打印數據。

來源

2014-11-08 06:11:23

使用python無限滾動的抓取網站

回答

相關問題