如何使用Scrapy和Splash來抓取基於AJAX的網站？

我想製作一個通用的抓取工具，可以抓取和抓取任何類型的網站（包括AJAX網站）的所有數據。我已經廣泛搜索了互聯網，但找不到任何適當的鏈接可以解釋我如何Scrapy和Splash一起可以刮AJAX網站（其中包括分頁，表單數據和點擊按鈕之前頁面顯示）。我提到的每個鏈接都告訴我，Javascript網站可以使用Splash渲染，但沒有關於使用Splash渲染JS網站的很好的教程/解釋。請不要給我有關使用瀏覽器的解決方案（我想以編程方式執行所有操作，歡迎使用無頭瀏覽器，但我想使用Splash）。如何使用Scrapy和Splash來抓取基於AJAX的網站？

class FlipSpider(CrawlSpider): 
    name = "flip" 
    allowed_domains = ["www.amazon.com"] 

    start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile'] 

    rules = (Rule(LinkExtractor(), callback='lol', follow=True), 

    def parse_start_url(self,response): 

     yield scrapy.Request(response.url, self.lol, meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}}) 

    def lol(self, response): 
     """ 
     Some code

來源

2017-06-08 Rohan

您是否遵守[飛濺DOC（https://github.com/scrapy-plugins/scrapy-splash#installation）？你確切的問題是什麼？ –

是的，我做過。 Splash doc只是提到我們可以使用的命令。我想知道如何使用它們來運行一個網站的JavaScript來獲取動態內容... – Rohan

那麼如果你沒有關於飛濺特定的問題或問題，我不會複製粘貼文檔...如果你參考對於文檔，您應該可以抓取基於JavaScript的網站 –

可以效仿行爲，就像一個ckick，或滾動，通過書面方式一的JavaScript功能，並告訴飛濺，當它呈現你的網頁來執行該腳本。

小爲例：

你定義一個的JavaScript功能，其選擇在所述頁面和元件然後點擊它：

（來源：splash doc）

-- Get button element dimensions with javascript and perform mouse click. 
_script = """ 
function main(splash) 
    assert(splash:go(splash.args.url)) 
    local get_dimensions = splash:jsfunc([[ 
     function() { 
      var rect = document.getElementById('button').getClientRects()[0]; 
      return {"x": rect.left, "y": rect.top} 
     } 
    ]]) 
    splash:set_viewport_full() 
    splash:wait(0.1) 
    local dimensions = get_dimensions() 
    splash:mouse_click(dimensions.x, dimensions.y) 

    -- Wait split second to allow event to propagate. 
    splash:wait(0.1) 
    return splash:html() 
end 
"""

然後，當你request，你修改endpoint並將其設置爲"execute"，並將"lua_script": _script添加到參數。

例：

def parse(self, response): 
    yield SplashRequest(response.url, self.parse_elem, 
         endpoint="execute", 
         args={"lua_source": _script})

你會發現約飛濺腳本的所有信息here

來源

2017-06-08 13:31:56

謝謝！很好的解釋。我想知道我們是否可以使用scrapy + splash在網頁上執行所有JavaScript？ – Rohan

與飛濺和分頁的問題是以下幾點：

我沒能產品Lua腳本提供一個新的網頁（點擊分頁鏈接後），其格式爲響應。而不是純粹的HTML。

所以，我的解決方案如下 - 點擊鏈接並提取新生成的網址，並將抓取工具指向此新網址。

所以，我有分頁鏈接在頁面上我執行

yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})

與以下的Lua腳本

def parse_categories(self, response): 
script = """ 
      function main(splash) 
       assert(splash:go(splash.args.url)) 
       splash:wait(1) 
       splash:runjs('document.querySelectorAll(".next-page")[0].click()') 
       splash:wait(1) 
       return splash:url() 
      end 
      """

和GET_URL功能

def get_url(self,response): 
    yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)

這樣我就能夠循環查詢。

同樣的方法，如果你不指望新的URL你的Lua腳本可以產生純粹的HTML，你必須使用正則表達式（這是不好的） - 但這是我能做到的最好的。

來源

2017-07-05 21:43:59

如何使用Scrapy和Splash來抓取基於AJAX的網站？

回答

相關問題