2016-11-15 92 views
0

我對R仍然很陌生,並嘗試以特定方式總結數據。爲了在此說明,我使用了nasaweather包中的天氣數據。例如,我想獲得特定日期的平均溫度,並顯示此數據集中包含的3個起點和12個月的平均溫度。用查詢彙總數據集中的選定條目

我想我可以使用下面的代碼來完成它,我指定了我感興趣的那一天,創建一個空數據框來填充,然後運行一個for循環來計算平均值的月份每個原點的溫度,將它們與月份聯繫起來,然後將它們與數據框聯繫起來。最後,我調整了列名,並打印出結果:

library(nasaweather) 
library(magrittr) 
library(dplyr) 

query_day = 15 
data_output <- data.frame(month = numeric(), 
       EWR = numeric(), 
       JFK = numeric(), 
       LGA = numeric()) 

for (i in 1:12) { 
    data_subset <- weather %>% 
    filter(day == query_day, month == i) %>% 
    summarize(
     EWR = mean(temp[origin == "EWR"]), 
     JFK = mean(temp[origin == "JFK"]), 
     LGA = mean(temp[origin == "LGA"])) 
    data_output <- rbind(data_output, cbind(i, data_subset)) 
    rm(data_subset) 
} 

names(data_output) <- c("month", "EWR", "JFK", "LGA") 
print(data_output) 

在我手中這會產生如下:

month  EWR  JFK  LGA 
1  1 39.3725 39.0875 38.9150 
2  2 42.1625 39.3425 42.9050 
3  3 37.4150 36.7775 37.3025 
4  4 50.1275 48.1550 49.2050 
5  5 58.8725 55.7150 59.1575 
6  6 70.7825 70.2950 71.5700 
7  7 86.9900 85.1225 87.2000 
8  8 69.2075 69.0725 69.9425 
9  9 60.6350 61.2125 61.7375 
10 10 59.8850 58.3850 60.5150 
11 11 45.7475 45.1700 49.0700 
12 12 32.4950 38.0975 34.0325 

這正是我想要的。我只是想,我的代碼似乎太複雜了,想問問是否有更簡單的方法來完成這項工作?

+1

您可以只使用聚合函數,然後重塑'一個< - 集料(溫〜月+起點,天氣,平均值); reshape(a,id ='month',...)' –

+0

謝謝@Dirk,但是如果我正確地得到它,這將產生整個月的平均溫度,而不是特定日的平均溫度。有沒有一種方法來指定在聚合函數內? –

+0

啊錯過了'a < - 集合(temp〜month + origin,weather [weather $ day == query_day,],mean);重塑(a,id ='month',...)' –

回答

1

你的代碼存在各種各樣的問題......但最主要的是你沒有先group_by。只要你包括這一點,這變得容易俗氣。看看我的意見,你的代碼,然後再在底部的定稿代碼:

library(nasaweather) ## Wrong package 
# library(magrittr) ## not needed, it's called by dplyr 
library(dplyr) 

query_day = 15 
# data_output <- data.frame(month = numeric(), ## We won't need to specify this explicitly 
## (but you are right that you should specify this in a for loop. Go one step 
## further by actually telling the data.frame how many rows to expect. 
## But not in this case cause we won't use for loop) 
         # EWR = numeric(), 
         # JFK = numeric(), 
         # LGA = numeric()) 

for (i in 1:12) { ## You don't need to do a for loop... you can do it with the summarize_by function. 
    data_subset <- weather %>% 
    filter(day == query_day, month == i) %>% 
    summarize(  ## Before doing summarize, you need a group_by to say what to summarize_by 
     EWR = mean(temp[origin == "EWR"]), 
     JFK = mean(temp[origin == "JFK"]), 
     LGA = mean(temp[origin == "LGA"])) 
    data_output <- rbind(data_output, cbind(i, data_subset)) ## If you're doing the group_by, this step isn't required. 
    # rm(data_subset) ## You don't have to remove temporary datasets... 
## When the for loop ends, they are automatically removed. 
} 

names(data_output) <- c("month", "EWR", "JFK", "LGA") 
print(data_output) 

################### Better code: 
library(nycflights13) ## your the package you waant is nycflights13... not nasaweather 
library(dplyr) 

query_day = 15 

weather %>% 
    filter(day == query_day) %>% 
    group_by(month) %>% 
    summarize(
     EWR = mean(temp[origin == "EWR"]), 
     JFK = mean(temp[origin == "JFK"]), 
     LGA = mean(temp[origin == "LGA"])) -> data_output 

data_output 

產量:

> data_output 
# A tibble: 12 × 4 
    month  EWR  JFK  LGA 
    <dbl> <dbl> <dbl> <dbl> 
1  1 39.3725 39.0875 38.9150 
2  2 42.1625 39.3425 42.9050 
3  3 37.4150 36.7775 37.3025 
4  4 50.1275 48.1550 49.2050 
5  5 58.8725 55.7150 59.1575 
6  6 70.7825 70.2950 71.5700 
7  7 86.9900 85.1225 87.2000 
8  8 69.2075 69.0725 69.9425 
9  9 60.6350 61.2125 61.7375 
10 10 59.8850 58.3850 60.5150 
11 11 45.7475 45.1700 49.0700 
12 12 32.4950 38.0975 34.0325 
+0

感謝@Amit對所有這些有用的評論,非常感謝!我先用「group_by」嘗試過,但從來沒有得到它的工作,猜測我一定做了一些非常錯誤的事情。但是,當我運行改進版本的11行(##更好的代碼:)時,我不回收12×4的粗體,但只有一行:'1 54.47438 53.86937 55.12938',每個值爲'EWR JFK LGA ',我想這是所有12個月的平均值。任何想法我(又一次)在這裏做錯了? –

+0

聽起來很奇怪...清除控制檯,甚至重新啓動RStudio,然後重試?我只是重新嘗試,它正常工作。 –

+0

重新啓動RStudio完成了這項工作,現在它也適用於我,謝謝! –