2017-04-13 25 views
1

我從網上下載天氣數據。爲此,我創建了簡單的for循環,它將包含數據的數據框添加到列表中(一個城市的一個列表)。它的工作正常,但如果沒有數據(沒有天氣情況的網頁上的特定日期的表)它會返回一個錯誤 - 例如對此URL(「https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland」)。如果錯誤比在R

Error in Lublin[i] <- url4 %>% read_html() %>% html_nodes(xpath = "//*[@id=\"obsTable\"]") %>% : 
    replacement has length zero 

如何在發生錯誤並將其放入列表時如何將if語句返回與NA的行(13個觀察值)?

還有更快的方式來下載所有的數據比for循環嗎?

我的代碼:

c<-seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days") 
Warszawa <- list() 
Wroclaw <- list() 
Bydgoszcz <- list() 
Lublin <- list() 
Gorzow <- list() 
Lodz <- list() 
Krakow <- list() 
Opole <- list() 
Rzeszow <- list() 
Bialystok <- list() 
Gdansk <- list() 
Katowice <- list() 
Kielce <- list() 
Olsztyn <- list() 
Poznan <- list() 
Szczecin <- list() 
date <- list() 
for(i in 1:length(c)) { 
y<-as.numeric(format(c[i],'%Y')) 
m<-as.numeric(format(c[i],'%m')) 
d<-as.numeric(format(c[i],'%d')) 
date[i] <- c[i] 
url1 <- sprintf("https://www.wunderground.com/history/airport/EPWA/%d/%d/%d/DailyHistory.html?req_city=Warszawa&req_state=MZ&req_statename=Poland", y, m, d) 
url2 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Wrocław&req_statename=Poland", y, m, d) 
url3 <- sprintf("https://www.wunderground.com/history/airport/EPBY/%d/%d/%d/DailyHistory.html?req_city=Bydgoszcz&req_statename=Poland", y, m, d) 
url4 <- sprintf("https://www.wunderground.com/history/airport/EPLB/%d/%d/%d/DailyHistory.html?req_city=Abramowice%%20Koscielne&req_statename=Poland", y, m, d) 
url5 <- sprintf("https://www.wunderground.com/history/airport/EPZG/%d/%d/%d/DailyHistory.html?req_city=Gorzow%%20Wielkopolski&req_statename=Poland", y, m, d) 
url6 <- sprintf("https://www.wunderground.com/history/airport/EPLL/%d/%d/%d/DailyHistory.html?req_city=Lodz&req_statename=Poland", y, m, d) 
url7 <- sprintf("https://www.wunderground.com/history/airport/EPKK/%d/%d/%d/DailyHistory.html?req_city=Krakow&req_statename=Poland", y, m, d) 
url8 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Opole&req_statename=Poland", y, m, d) 
url9 <- sprintf("https://www.wunderground.com/history/airport/EPRZ/%d/%d/%d/DailyHistory.html?req_city=Rzeszow&req_statename=Poland", y, m, d) 
url10 <- sprintf("https://www.wunderground.com/history/airport/UMMG/%d/%d/%d/DailyHistory.html?req_city=Dojlidy&req_statename=Poland", y, m, d) 
url11 <- sprintf("https://www.wunderground.com/history/airport/EPGD/%d/%d/%d/DailyHistory.html?req_city=Gdansk&req_statename=Poland", y, m, d) 
url12 <- sprintf("https://www.wunderground.com/history/airport/EPKM/%d/%d/%d/DailyHistory.html?req_city=Katowice&req_statename=Poland", y, m, d) 
url13 <- sprintf("https://www.wunderground.com/history/airport/EPKT/%d/%d/%d/DailyHistory.html?req_city=Chorzow%%20Batory&req_statename=Poland", y, m, d) 
url14 <- sprintf("https://www.wunderground.com/history/airport/EPSY/%d/%d/%d/DailyHistory.html", y, m, d) 
url15 <- sprintf("https://www.wunderground.com/history/airport/EPPO/%d/%d/%d/DailyHistory.html?req_city=Poznan%%20Old%%20Town&req_statename=Poland", y, m, d) 
url16 <- sprintf("https://www.wunderground.com/history/airport/EPSC/%d/%d/%d/DailyHistory.html?req_city=Szczecin&req_statename=Poland", y, m, d) 

Warszawa[i] <- url1 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Wroclaw[i] <- url2 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Bydgoszcz[i] <- url3 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Lublin[i] <- url4 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Gorzow[i] <- url5 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Lodz[i] <- url6 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Krakow[i] <- url7 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Opole[i] <- url8 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Rzeszow[i] <- url9 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Bialystok[i] <- url10 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Gdansk[i] <- url11 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Katowice[i] <- url12 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Kielce[i] <- url13 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Olsztyn[i] <- url14 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Poznan[i] <- url15 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
Szczecin[i] <- url16 %>% 
    read_html() %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 

} 

感謝您的幫助。

+0

你可以用'tryCatch'爲 –

+0

提示:使用'c'作爲變量。因爲它用於在R中創建矢量。 –

+0

您也有相當多的重複代碼。我認爲你可以創建一個功能,在換出你需要的變量時做同樣的事情。至於錯誤,我會遵循@docendo discimus的建議。 –

回答

1

首先,我得到了一點點失望,答案比最初的計劃稍長。我決定幫助你解決三個問題:識別有效URL的重複問題;在獲取這些網址的相關信息時出現重複問題;並且還有刮臉時的錯誤問題。

所以在這裏我們去,你會想要得到的鏈接你簡單一點地想刮:

library(httr) 
library(rvest) 

## All the dates: 
dates <- seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days") 
dates <- gsub("-", "/", x = dates) 

## All the regions and links: 
abbreviations <- c("EPWA", "EPWR", "EPBY", "EPLB", "EPZG", "EPLL", "EPKK",   
         "EPWR", "EPRZ", "UMMG", "EPGD", "EPKM", "EPKT", 
         "EPSY", "EPPO", "EPSC") 

links <- paste0("https://www.wunderground.com/history/airport/", 
       abbreviations, "/") 
links <- lapply(links, function(x){paste0(x, dates, "/DailyHistory.html")}) 

現在,我們已經在links所有環節,我們將定義將檢查功能鏈接和刮取HTML並獲取我們想要的任何信息。在你的情況下,這將是:城市名稱,日期和天氣表。我決定用城市名稱和日期作爲對象的名稱,讓您可以輕鬆哪些天氣表屬於哪個城市和日期:

## Get the weather report & name 
get_table <- function(link){ 
    # Get the html from a link 
    html <- try(link %>% 
      read_html()) 
    if("try-error)" %in% class(html)){ 
     print("HTML not found, skipping to next link") 
     return("HTML not found, skipping to next link") 
    } 

    # Get the weather table from that page 
    weather_table <- html %>% 
    html_nodes(xpath='//*[@id="obsTable"]') %>% 
    html_table() 
    if(length(weather_table) == 0){ 
    print("No weather table available for this day") 
    return("No weather table available for this day") 
    } 

    # Use info from the html to get the city, for naming the list 
    region <- html %>% 
    html_nodes(xpath = '//*[@id="location"]') %>% 
    html_text() 
    region <- strsplit(region, "[1-9]")[[1]][1] 
    region <- gsub("\n", "", region) 
    region <- gsub("\t\t", "", region) 

    # Use info from the html to get the date, and name the list 
    which_date <- html %>% 
    html_nodes(xpath = '//*[@class="history-date"]') %>% 
    html_text() 

    city_date <- paste0(region, which_date) 

    # Name the output 
    names(weather_table) <- city_date 

    print(paste0("Just scraped ", city_date)) 
    return(weather_table) 
} 

運行這個功能應該爲我們確定的所有網址,包括故障工作網址你在你的問題

# A little test-run, to see if your faulty URL works: 
    testlink  <- "https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland" 
    links[[1]][5] <- testlink 
    tested  <- sapply(links[[1]][1:6], get_table, USE.NAMES = FALSE) 
    # [1] "Just scraped Warsaw, Poland Thursday, January 1, 2015" 
    # [1] "Just scraped Warsaw, Poland Friday, January 2, 2015" 
    # [1] "Just scraped Warsaw, Poland Saturday, January 3, 2015" 
    # [1] "Just scraped Warsaw, Poland Sunday, January 4, 2015" 
    # [1] "No weather table available for this day" 
    # [1] "Just scraped Warsaw, Poland Tuesday, January 6, 2015" 

的作品就像一個魅力公佈,因此您可以使用下面的循環得到波蘭的氣象資料:

# For all sublists in links (corresponding to cities) 
# scrape all links (corresponding to days) 
city <- rep(list(list()), length(abbreviations)) 
for(i in 1:length(links)){ 
    city[[i]] <- sapply(links[[i]], get_table, USE.NAMES = FALSE) 
} 
+0

謝謝!這是偉大的:)我設法擺脫沒有數據錯誤,並簡化我的代碼,但在那裏發現哪些輸出是哪些問題...這簡直太棒了:) – piotr

+0

但是當循環從一個元素到另一個元素有一個小問題''''剛剛颳了波蘭華沙星期三,2016年12月28日「 [1]」剛剛颳了波蘭華沙2016年12月29日星期四「 [1]」剛剛刮到波蘭華沙,2016年12月30日星期五「 [1]」剛剛刮到波蘭華沙,2016年12月31日星期六「 [error] < - sapply(links [[i]],get_table,USE.NAMES = FALSE): object'city'not found' – piotr

+0

是的,你說得對。我通過創建一個帶有子列表的空列表來修復它 –

1

因爲所有這些URL基本上都是相同的東西,而且具有輕微且可預測的區別,爲什麼不循環數組,將所有東西連接在一起並運行。

這是我所指的一個例子。

library(rvest) 
library(stringr) 

#create a master dataframe to store all of the results 
complete <- data.frame() 

yearsVector <- c("2010", "2011", "2012", "2013", "2014", "2015") 
#position is not needed since all of the info is stored on the page 
#positionVector <- c("qb", "rb", "wr", "te", "ol", "dl", "lb", "cb", "s") 
positionVector <- c("qb") 
for (i in 1:length(yearsVector)) { 
    for (j in 1:length(positionVector)) { 
     # create a url template 
     URL.base <- "http://www.nfl.com/draft/" 
     URL.intermediate <- "/tracker?icampaign=draft-sub_nav_bar-drafteventpage-tracker#dt-tabs:dt-by-position/dt-by-position-input:" 
     #create the dataframe with the dynamic values 
     URL <- paste0(URL.base, yearsVector[i], URL.intermediate, positionVector[j]) 
     #print(URL) 

     #read the page - store the page to make debugging easier 
     page <- read_html(URL)