2014-02-10 91 views
6

我正在嘗試網頁抓取頁面。但是,我的循環不時無法正常工作,因爲解析器「無法加載HTTP資源」。問題是該頁面無法在我的瀏覽器中加載,因此它不是代碼問題。處理htmlParse錯誤(未能加載HTTP資源)

但是,在創建每個頁面的異常後,如果發現錯誤,必須重新啓動進程,這非常煩人。我想知道是否有辦法提出if條件。我想到的是:如果發生錯誤,則在下一步重新啓動循環。

我打開htmlParse的幫助頁面,發現有錯誤參數,但無法理解如何使用它。任何想法,如果我的條件?

下面是一個可重複的例子:

if(require(RCurl) == F) install.packages('RCurl') 
if(require(XML) == F) install.packages('XML') 
if(require(seqinr) == F) install.packages('seqinr') 

for (i in 575:585){ 
    currentPage <- i # define pagina inicial da busca 
# Link que ser? procurado 

link <- paste("http://www.cnj.jus.br/improbidade_adm/visualizar_condenacao.php?seq_condenacao=", 
      currentPage, 
      sep='') 

doc <- htmlParse(link, encoding = "UTF-8") #this will preserve characters 
    tables <- readHTMLTable(doc, stringsAsFactors = FALSE) 
    if(length(tables) != 0) { 
    tabela2 <- as.data.frame(tables[10]) 

    tabela2[,1] <- gsub("\\n", " ", tabela2[,1]) 
    tabela2[,2] <- gsub("\\n", " ", tabela2[,2]) 
    tabela2[,2] <- gsub("\\t", " ", tabela2[,2]) 

    listofTabelas[[i]] <- tabela2 

    tabela1 <- do.call("rbind", listofTabelas) 
    names(tabela1) <- c("Variaveis", "status") 

    } 
} 

回答

8

你可能會使用httr包會更好。

library(httr) 
library(XML) 

url <- "http://www.cnj.jus.br/improbidade_adm/visualizar_condenacao.php" 
for (i in 575:585){ 
    response<- GET(url,path="/",query=c(seq_condenacao=as.character(i))) 
    if (response$status_code!=200){ # HTTP request failed!! 
    # do some stuff... 
    print(paste("Failure:",i,"Status:",response$status_code)) 
    next 
    } 
    doc <- htmlParse(response, encoding = "UTF-8") 
    # do some other stuff 
    print(paste("Success:",i,"Status:",response$status_code)) 
} 
# [1] "Success: 575 Status: 200" 
# [1] "Success: 576 Status: 200" 
# [1] "Success: 577 Status: 200" 
# [1] "Success: 578 Status: 200" 
# [1] "Success: 579 Status: 200" 
# [1] "Success: 580 Status: 200" 
# [1] "Success: 581 Status: 200" 
# [1] "Success: 582 Status: 200" 
# [1] "Success: 583 Status: 200" 
# [1] "Success: 584 Status: 200" 
# [1] "Success: 585 Status: 200" 
+0

是什麼文檔

+0

對不起,它應該是'htmlParse(響應,...)'。答案已被編輯。 – jlhoward