2015-10-05 54 views
0

我有我使用的數據,它是計數數據,即每個日期+時間組合代表一個數據點。所以我目前的數據幀如下所示:使計數數據框

DATE  TIME 
1 2014-02-15 15:02 
2 2014-02-15 15:12 
3 2014-04-15 02:02 
4 2014-05-15 11:02 
5 2014-06-15 15:42 
6 2014-06-15 16:02 
.... 

現在我想有一個新的DF計數多少個數據點每小時有特定日期。類似下面:

DATE  HOUR COUNT 
1 2014-02-15 15  2 
2 2014-04-15 02  1 
3 2014-05-15 11  1 
4 2014-06-15 15  1 
5 2014-06-15 16  1 
.... 

我想這樣做,這樣我可以在X =小時,一天中的箱線圖,Y =數據點的數量(超過一年)。試圖用嵌套for循環來做,但它沒有奏效。


編輯:如果可能的話,日期/時間組合,在沒有數據點應在數據幀,但COUNT = 0

+1

您剛剛編輯的第二部分是有點複雜。 – TARehman

回答

1

是你在找什麼?

options(stringsAsFactors = F) 

data = read.table(text = 
"     1 2014-02-15 15:02 
        2 2014-02-15 15:12 
        3 2014-04-15 02:02 
        4 2014-05-15 11:02 
        5 2014-06-15 15:42 
        6 2014-06-15 16:02") 


colnames(data) = c("index", "date", "time") 

table(data$date) 

# 2014-02-15 2014-04-15 2014-05-15 2014-06-15 
#  2   1   1   2 

table(data$date, data$time) 

fz = table(data$date, substr(data$time, 1,2)) 
print(fz) 

#   02 11 15 16 
# 2014-02-15 0 0 2 0 
# 2014-04-15 1 0 0 0 
# 2014-05-15 0 1 0 0 
# 2014-06-15 0 0 1 1 

如果你想重塑你的數據,你可以做到以下幾點:

library(reshape) 

otherFormat = melt(fz) 
colnames(otherFormat) = c("date","hour", "frequency") 

print(otherFormat) 

#   date hour frequency 
# 1 2014-02-15 2   0 
# 2 2014-04-15 2   1 
# 3 2014-05-15 2   0 
# 4 2014-06-15 2   0 
# 5 2014-02-15 11   0 
# 6 2014-04-15 11   0 
# 7 2014-05-15 11   1 
# 8 2014-06-15 11   0 
# 9 2014-02-15 15   2 
# 10 2014-04-15 15   0 
# 11 2014-05-15 15   0 
# 12 2014-06-15 15   1 
# 13 2014-02-15 16   0 
# 14 2014-04-15 16   0 
# 15 2014-05-15 16   0 
# 16 2014-06-15 16   1 
+2

據透露,您可以通過在'as.data.frame'包裝'table',像'as.data.frame重塑(表(數據$日期,SUBSTR(數據$時間,1,2)))'我想你順便說一下,忘了定義這個'fz'對象。 – Frank

+0

想想這正是我的意思!謝謝!還有一個問題,我的數據只顯示有數據點的日期/時間。但是如果我想在某個時間顯示沒有數據點,我該如何添加這些數據?因此,我希望數據框從兩個日期開始運行,並將所有24小時顯示爲不同的行,並且如果該數據/小時組合的計數爲0,則必須說freq = 0,不能省略。 – Guido167

+0

謝謝弗蘭克!我總是直接使用熔化。偉大的提示!是的,我忘了定義fz。 – 2015-10-06 07:16:19

0

您可以有幾種方法做到這一點,但我懷疑最簡單的方法是使用table。使用「表格」,您可以返回日期的頻率。這基本上只是數據框中日期的計數。

您可以在提取小時後做同樣的事情 - 甚至可以通過執行table(DF$DATE,DF$HOUR)來嵌套它。使用as.data.frame會讓你的上市有點像你正在尋找的東西。

編輯添加:爲迴應您對問題的編輯,您可以使用factor級別獲得table聲明中的零級別。 table通過將它們包括在輸出中來尊重你的因素水平,即使它在輸入中找不到(實際上,我認爲table強制輸入到背面的因素中)。

示例代碼:

# Set options and load example data 
options(stringsAsFactors = FALSE) 
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"), 
         TIME = c("15:02","15:12","02:02","11:02","15:42","16:02")) 

# Extract the hour 
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1) 

# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting 
date.data$HOUR <- factor(x = date.data$HOUR, 
         levels = c("00","01","02","03","04","05", 
            "06","07","08","09","10","11", 
            "12","13","14","15","16","17", 
            "18","19","20","21","22","23"), 
         labels = c("00","01","02","03","04","05", 
            "06","07","08","09","10","11", 
            "12","13","14","15","16","17", 
            "18","19","20","21","22","23")) 

# Obtain the first table of interest 
as.data.frame(table(date.data$DATE)) 

     Var1 Freq 
1 2014-02-15 2 
2 2014-04-15 1 
3 2014-05-15 1 
4 2014-06-15 2 

# And the second table 
as.data.frame(table(date.data$DATE,date.data$HOUR)) 

     Var1 Var2 Freq 
1 2014-02-15 00 0 
2 2014-04-15 00 0 
3 2014-05-15 00 0 
4 2014-06-15 00 0 
5 2014-02-15 01 0 
6 2014-04-15 01 0 
7 2014-05-15 01 0 
8 2014-06-15 01 0 
.... 
+0

謝謝你的評論,但這不完全是我的意思(儘管非常接近)。我需要日期和小時的頻率計數,所以如果我在2014-02-15小時15.xx有兩個數據點,頻率應該是2.使用您的代碼,我可以得到兩個不同的頻率表(一個用於日期頻率和一小時頻率)。 – Guido167

+0

但是...是不是第二個給你你需要什麼?這是日期和時間在一起。 – TARehman

1

IMO,最可讀的方式:

編輯回答你更新的問題

library(dplyr) 
library(stringr) 

df <- date.data %>% 
    group_by(
    DATE = as.Date(DATE), 
    HOUR = as.numeric(str_sub(TIME, 1, 2)) 
    ) %>% 
    tally 

# create a data frame with all dates/hours 
expand.grid(
    # include all dates from first to last 
    DATE = seq.Date(min(df$DATE), max(df$DATE), "day"), 
    HOUR = 0:23 
) %>% 
    arrange(DATE) %>% 
    left_join(df, by = c("DATE", "HOUR")) 
+1

CPL的多行用'tidyr :: expand'和'在混合left_join' WLD也包含在產品的解決OP的編輯請求。 – hrbrmstr

1

其他選項如下。首先,您在mutate()中創建一個小時的列。然後,您可以計算和hourcount()中存在多少個數據點。一旦取消了數據分組,您可以加入兩個數據框來創建您想要的結果。 expand.grid()部分創建DATE和小時(00到23)的所有組合。既然你有2 2,我使用c(paste0("0", 0:9), 10:23))。最後,在最後的mutate()中將NA替換爲0。

library(dplyr) 
library(stringi) 
library(data.table) 

mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>% 
count(DATE, hour) %>% 
ungroup %>% 
right_join(expand.grid(DATE = unique(.$DATE), 
         hour = c(paste0("0", 0:9), 10:23))) %>% 
mutate(n = replace(n, is.na(n), 0)) 

# A bit of outcome 
#   DATE hour n 
#1 2014-02-15 00 0 
#2 2014-04-15 00 0 
#3 2014-05-15 00 0 
#4 2014-06-15 00 0 
#5 2014-02-15 01 0 

使用data.table,你可以做同樣的操作。您創建了一列hour並通過DATEhour來統計數據點的數量。然後,您要合併temp與數據表,該數據表具有DATE和小時(00到23)的所有組合。您可以使用CJ()創建數據表。一旦你完成合並過程,你在列與0取代NA的計數(total)。

setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][, 
      list(total = .N), by = list(DATE, hour)] -> temp 

merge(temp, 
     CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)), 
     by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][] 

#   DATE hour total 
# 1: 2014-02-15 02  0 
# 2: 2014-02-15 11  0 
# 3: 2014-02-15 15  2 
# 4: 2014-02-15 16  0 
# 5: 2014-02-15 00  0 

DATA

mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205, 
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L, 
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42", 
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE", 
"TIME"), row.names = c(NA, -6L))