2016-10-22 68 views
5

我試圖在需要登錄的網站上抓取頁面,並且很快得到403錯誤。當使用Rvest登錄網站進行刮擦時出現403錯誤

我已經修改了來自這兩個職位的代碼爲我的網站,Using rvest or httr to log in to non-standard forms on a webpagehow to reuse a session to avoid repeated login when scraping with rvest?

library(rvest) 
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1") 
pgform <- html_form(pgsession)[[1]] 
filled_form <- set_values(pgform, 'username'='user', 'password'='pass') 
s <- submit_form(pgsession, filled_form) # s is your logged in session 

當運行的代碼,我得到這個消息:

Submitting with 'NULL' 
Warning message: 
In request_POST(session, url = url, body = request$values, encode = request$encode, : 
    Forbidden (HTTP 403). 

我也跑了通過將user_agent更新爲RS,以此方式進行編碼然而,在評論中提出,我收到與上面相同的錯誤。

library(rvest) 
library(httr) 
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36" 
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring)) 
pgform <- html_form(pgsession)[[1]] 
filled_form <- set_values(pgform, 'username'='user', 'password'='pass') 
s <- submit_form(pgsession, filled_form) # s is your logged in session 

如果你拉的頁面,而不會登錄,就說明你在右下角的文字下方的位數據表的:「收益事件記錄:65」

一旦登錄,它將顯示全部65個事件,並且表格將被填充,這是我想要下載的內容。我擁有所需的所有代碼,但只是登錄部分。

謝謝你的幫助。

+1

不應該'submit_form(pgsession,pgform)'是'submit_form(pgsession,filled_form)' –

+0

你試過設置/改變用戶 - 代理?編輯:你肯定需要用filled_form調用submit_form,因爲@Chirayu說 –

+0

@ChirayuChamoli,我已經更新了你指出的錯誤以及收到的錯誤信息。感謝您指出我的第一個錯誤。 – mks212

回答

4

使用R.S的建議,我使用RSelenium成功登錄。

對於mac用戶使用chrome或phantom的快速注意事項。我正在運行El Capitan,所以有一些問題讓Mac識別到這兩個bin文件的路徑。相反,我將bin文件移動到/ usr/local/bin,並且它們運行時沒有問題。

下面是代碼這樣做:

library(RSelenium) 
RSelenium::startServer() 
remDr <- remoteDriver(browserName = "chrome") 
remDr$open() 
appURL <- 'https://www.optionslam.com/accounts/login/' 
remDr$navigate(appURL) 
remDr$findElement("id", "id_username")$sendKeysToElement(list("user")) 
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter')) 

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1' 
remDr$navigate(appURL) 

這也可以用幻象完成,

library(RSelenium) 

pJS <- phantom() # start phantomjs 

appURL <- 'https://www.optionslam.com/accounts/login/' 
remDr <- remoteDriver(browserName = "phantomjs") 
remDr$open() 
remDr$navigate(appURL) 
remDr$findElement("id", "id_username")$sendKeysToElement(list("user")) 
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter')) 

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1' 
remDr$navigate(appURL) 
+1

很高興看到您能夠解決問題。 –

0

這裏的答案與rvest,解決了原來使用的情況下,問題:

library(rvest) 
    library(httr) 
    uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36" 

    pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring)) 

    pgform <- html_form(pgsession)[[1]] 

    filled_form <- set_values(pgform, 
          username = 'un', 
          password = 'ps') 

    s <- submit_form(pgsession, filled_form, submit = NULL, config(referer = pgsession$url)) # s is your logged in session 

請求需要知道你來自的頁面(referer(原文如此))。

config(referer = pgsession$url)