Scrapy Shell和Scrapy Splash

2016-02-11 122 views 7 likes

我們一直在使用scrapy-splash middleware來通過Splash腳本程序容器中運行的JavaScript引擎傳遞刮過的HTML源代碼。Scrapy Shell和Scrapy Splash

如果我們想在蜘蛛用飛濺，我們配置一些required project settings併產生Request指定特定meta arguments：

yield Request(url, self.parse_result, meta={ 
    'splash': { 
     'args': { 
      # set rendering arguments here 
      'html': 1, 
      'png': 1, 

      # 'url' is prefilled from request url 
     }, 

     # optional parameters 
     'endpoint': 'render.json', # optional; default is render.json 
     'splash_url': '<url>',  # overrides SPLASH_URL 
     'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN, 
    } 
})

這可以作爲證明。但是，如何在Scrapy Shell內使用scrapy-splash？

來源

2016-02-11 alecxe

這是真的沒有'DEFAULT_REQUEST_META'像有一個[DEFAULT_REQUEST_HEADERS（http://doc.scrapy.org/en/latest/topics/settings.html?#std:setting-DEFAULT_REQUEST_HEADERS），這將是一個不錯加成。有關於通過中間件默認啓用Splash的公開討論（請參閱https://github.com/scrapinghub/scrapy-splash/issues/11）。另一種選擇是在這裏繼承scrapy-splash mdw和強制設置。想法歡迎https://github.com/scrapinghub/scrapy-splash/issues –

回答

只是把你想要的網址包裝到splash http api。

所以，你會想是這樣的：

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

其中localhost:port是你飛濺服務運行
url是要爬網和不要忘記urlquote它的網址！
render.html是可能的HTTP API的一個端點，以秒在這種情況下返回redered html頁面
timeout時間超時，以秒
wait時間等待JavaScript來讀取/保存HTML之前執行。

來源

2016-02-12 09:54:20 Granitosaurus

你可以做一個bash別名來使這個更方便。 – Granitosaurus