在python中執行SplashRequest時添加一個等待元素的元素Scrapy

我想在python中使用Splash for Scrapy來抓取一些動態網站。但是，我發現Splash無法在某些情況下等待加載完整頁面。解決這個問題的一種蠻力方法是增加一個大的wait時間（例如，在下面的片段中5秒）。但是，這是非常低效的，並且仍然無法加載某些數據（有時需要花費超過5秒的時間來加載內容）。通過這些請求是否存在某種等待元素條件？在python中執行SplashRequest時添加一個等待元素的元素Scrapy

yield SplashRequest(
      url, 
      self.parse, 
      args={'wait': 5}, 
      'User-Agent':"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36", 
      } 
)

來源

2016-12-10 NightFury13

是的，你可以寫一個Lua腳本來做到這一點。類似的東西：

function main(splash) 
    splash:set_user_agent(splash.args.ua) 
    assert(splash:go(splash.args.url)) 

    -- requires Splash 2.3 
    while not splash:select('.my-element') do 
    splash:wait(0.1) 
    end 
    return {html=splash:html()} 
end

飛濺2.3之前，你可以使用splash:evaljs('!document.querySelector(".my-element")')代替not splash:select('.my-element')。

將此腳本保存爲變量（lua_script = """ ... """）。然後你就可以發送一個請求是這樣的：

yield SplashRequest(
    url, 
    self.parse, 
    endpoint='execute', 
    args={ 
     'lua_source': lua_script, 
     'ua': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36" 
    } 
}

見腳本tutorial和reference關於如何寫飛濺Lua腳本的詳細信息。

來源

2016-12-12 10:08:57

要添加到解決方案中，我在運行上述腳本時遇到了「嘗試索引nil值」的Lua錯誤。問題在於'：exists（）'不能在'splash：select（'。my-element'）''返回的'nil'值上運行，因爲元素還沒有被渲染。因此，簡單地去掉'：exists（）'部分並檢查循環，而不是使用splash：select（'。my-element'）do'解決了我的問題。 – NightFury13

一個很好的@ NightFury13！我正在改變這個例子，以便將來得到這個答案的人不會遇到這個問題。 –

我有類似的要求，超時。我的解決方法是對以上內容進行一些修改：

function wait_css(splash, css, maxwait) 
    if maxwait == nil then 
     maxwait = 10  --default maxwait if not given 
    end 

    local i=0 
    while not splash:select(css) do 
     if i==maxwait then 
      break  --times out at maxwait secs 
     end 
     i=i+1 
     splash:wait(1)  --each loop has duration 1sec 
    end 
end

來源

2018-03-10 13:12:17 justint

在python中執行SplashRequest時添加一個等待元素的元素Scrapy

回答

相關問題