使計數數據框

我有我使用的數據，它是計數數據，即每個日期+時間組合代表一個數據點。所以我目前的數據幀如下所示：使計數數據框

DATE  TIME 
1 2014-02-15 15:02 
2 2014-02-15 15:12 
3 2014-04-15 02:02 
4 2014-05-15 11:02 
5 2014-06-15 15:42 
6 2014-06-15 16:02 
....

現在我想有一個新的DF計數多少個數據點每小時有特定日期。類似下面：

DATE  HOUR COUNT 
1 2014-02-15 15  2 
2 2014-04-15 02  1 
3 2014-05-15 11  1 
4 2014-06-15 15  1 
5 2014-06-15 16  1 
....

我想這樣做，這樣我可以在X =小時，一天中的箱線圖，Y =數據點的數量（超過一年）。試圖用嵌套for循環來做，但它沒有奏效。

編輯：如果可能的話，日期/時間組合，在沒有數據點應在數據幀，但COUNT = 0

來源

2015-10-05 Guido167

您剛剛編輯的第二部分是有點複雜。 – TARehman

是你在找什麼？

options(stringsAsFactors = F) 

data = read.table(text = 
"     1 2014-02-15 15:02 
        2 2014-02-15 15:12 
        3 2014-04-15 02:02 
        4 2014-05-15 11:02 
        5 2014-06-15 15:42 
        6 2014-06-15 16:02") 


colnames(data) = c("index", "date", "time") 

table(data$date) 

# 2014-02-15 2014-04-15 2014-05-15 2014-06-15 
#  2   1   1   2 

table(data$date, data$time) 

fz = table(data$date, substr(data$time, 1,2)) 
print(fz) 

#   02 11 15 16 
# 2014-02-15 0 0 2 0 
# 2014-04-15 1 0 0 0 
# 2014-05-15 0 1 0 0 
# 2014-06-15 0 0 1 1

如果你想重塑你的數據，你可以做到以下幾點：

library(reshape) 

otherFormat = melt(fz) 
colnames(otherFormat) = c("date","hour", "frequency") 

print(otherFormat) 

#   date hour frequency 
# 1 2014-02-15 2   0 
# 2 2014-04-15 2   1 
# 3 2014-05-15 2   0 
# 4 2014-06-15 2   0 
# 5 2014-02-15 11   0 
# 6 2014-04-15 11   0 
# 7 2014-05-15 11   1 
# 8 2014-06-15 11   0 
# 9 2014-02-15 15   2 
# 10 2014-04-15 15   0 
# 11 2014-05-15 15   0 
# 12 2014-06-15 15   1 
# 13 2014-02-15 16   0 
# 14 2014-04-15 16   0 
# 15 2014-05-15 16   0 
# 16 2014-06-15 16   1

來源

2015-10-05 15:14:42

據透露，您可以通過在'as.data.frame'包裝'table'，像'as.data.frame重塑（表（數據$日期，SUBSTR（數據$時間，1,2）））'我想你順便說一下，忘了定義這個'fz'對象。 – Frank

想想這正是我的意思！謝謝！還有一個問題，我的數據只顯示有數據點的日期/時間。但是如果我想在某個時間顯示沒有數據點，我該如何添加這些數據？因此，我希望數據框從兩個日期開始運行，並將所有24小時顯示爲不同的行，並且如果該數據/小時組合的計數爲0，則必須說freq = 0，不能省略。 – Guido167

謝謝弗蘭克！我總是直接使用熔化。偉大的提示！是的，我忘了定義fz。 – 2015-10-06 07:16:19

您可以有幾種方法做到這一點，但我懷疑最簡單的方法是使用table。使用「表格」，您可以返回日期的頻率。這基本上只是數據框中日期的計數。

您可以在提取小時後做同樣的事情 - 甚至可以通過執行table(DF$DATE,DF$HOUR)來嵌套它。使用as.data.frame會讓你的上市有點像你正在尋找的東西。

編輯添加：爲迴應您對問題的編輯，您可以使用factor級別獲得table聲明中的零級別。 table通過將它們包括在輸出中來尊重你的因素水平，即使它在輸入中找不到（實際上，我認爲table強制輸入到背面的因素中）。

示例代碼：

# Set options and load example data 
options(stringsAsFactors = FALSE) 
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"), 
         TIME = c("15:02","15:12","02:02","11:02","15:42","16:02")) 

# Extract the hour 
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1) 

# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting 
date.data$HOUR <- factor(x = date.data$HOUR, 
         levels = c("00","01","02","03","04","05", 
            "06","07","08","09","10","11", 
            "12","13","14","15","16","17", 
            "18","19","20","21","22","23"), 
         labels = c("00","01","02","03","04","05", 
            "06","07","08","09","10","11", 
            "12","13","14","15","16","17", 
            "18","19","20","21","22","23")) 

# Obtain the first table of interest 
as.data.frame(table(date.data$DATE)) 

     Var1 Freq 
1 2014-02-15 2 
2 2014-04-15 1 
3 2014-05-15 1 
4 2014-06-15 2 

# And the second table 
as.data.frame(table(date.data$DATE,date.data$HOUR)) 

     Var1 Var2 Freq 
1 2014-02-15 00 0 
2 2014-04-15 00 0 
3 2014-05-15 00 0 
4 2014-06-15 00 0 
5 2014-02-15 01 0 
6 2014-04-15 01 0 
7 2014-05-15 01 0 
8 2014-06-15 01 0 
....

來源

2015-10-05 14:58:54 TARehman

謝謝你的評論，但這不完全是我的意思（儘管非常接近）。我需要日期和小時的頻率計數，所以如果我在2014-02-15小時15.xx有兩個數據點，頻率應該是2.使用您的代碼，我可以得到兩個不同的頻率表（一個用於日期頻率和一小時頻率）。 – Guido167

但是...是不是第二個給你你需要什麼？這是日期和時間在一起。 – TARehman

IMO，最可讀的方式：

編輯回答你更新的問題

library(dplyr) 
library(stringr) 

df <- date.data %>% 
    group_by(
    DATE = as.Date(DATE), 
    HOUR = as.numeric(str_sub(TIME, 1, 2)) 
    ) %>% 
    tally 

# create a data frame with all dates/hours 
expand.grid(
    # include all dates from first to last 
    DATE = seq.Date(min(df$DATE), max(df$DATE), "day"), 
    HOUR = 0:23 
) %>% 
    arrange(DATE) %>% 
    left_join(df, by = c("DATE", "HOUR"))

來源

2015-10-05 15:18:25 davechilders

CPL的多行用'tidyr :: expand'和'在混合left_join' WLD也包含在產品的解決OP的編輯請求。 – hrbrmstr

其他選項如下。首先，您在mutate()中創建一個小時的列。然後，您可以計算和hour在count()中存在多少個數據點。一旦取消了數據分組，您可以加入兩個數據框來創建您想要的結果。 expand.grid()部分創建DATE和小時（00到23）的所有組合。既然你有2 2，我使用c(paste0("0", 0:9), 10:23))。最後，在最後的mutate()中將NA替換爲0。

library(dplyr) 
library(stringi) 
library(data.table) 

mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>% 
count(DATE, hour) %>% 
ungroup %>% 
right_join(expand.grid(DATE = unique(.$DATE), 
         hour = c(paste0("0", 0:9), 10:23))) %>% 
mutate(n = replace(n, is.na(n), 0)) 

# A bit of outcome 
#   DATE hour n 
#1 2014-02-15 00 0 
#2 2014-04-15 00 0 
#3 2014-05-15 00 0 
#4 2014-06-15 00 0 
#5 2014-02-15 01 0

使用data.table，你可以做同樣的操作。您創建了一列hour並通過DATE和hour來統計數據點的數量。然後，您要合併temp與數據表，該數據表具有DATE和小時（00到23）的所有組合。您可以使用CJ()創建數據表。一旦你完成合並過程，你在列與0取代NA的計數（total）。

setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][, 
      list(total = .N), by = list(DATE, hour)] -> temp 

merge(temp, 
     CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)), 
     by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][] 

#   DATE hour total 
# 1: 2014-02-15 02  0 
# 2: 2014-02-15 11  0 
# 3: 2014-02-15 15  2 
# 4: 2014-02-15 16  0 
# 5: 2014-02-15 00  0

DATA

mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205, 
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L, 
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42", 
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE", 
"TIME"), row.names = c(NA, -6L))

來源

2015-10-05 16:11:05 jazzurro

使計數數據框

回答

相關問題