2012-01-11 39 views
2

我有關於將數據分組到特定類別的問題。將值分組爲括號

一般來說,如果我有一個因子變量,我想下面/執行的東西桶重新編碼的數據轉換爲首選模式:

educ = NA 
educ[educ2 %in% levels(educ2)[c(5,8)]] <- "HS or Some College" 
educ[educ2 %in% levels(educ2)[2:3]] <- "College Degree" 
educ[educ2 %in% levels(educ2)[c(4,6)]] <- "Advanced Degree" 
educ[educ2 %in% levels(educ2)[c(1,7,9)]] <- NA 
educ = factor(educ) 

不過,我有試圖重新組合因子變量掙扎,時間,它有10,000 +級別。

> levels(wj$time) 
    [1] "0:00:05" "0:00:07" "0:00:08" "0:00:10" "0:00:13" "0:00:15" "0:00:18" "0:00:23" "0:00:31" "0:00:34" "0:00:36" 
    [12] "0:00:39" "0:00:41" "0:00:47" "0:00:48" "0:00:54" "0:00:55" "0:00:56" "0:00:59" "0:01:01" "0:01:02" "0:01:03" 
    [23] "0:01:13" "0:01:17" "0:01:31" "0:01:33" "0:01:41" "0:01:44" "0:01:48" "0:01:50" "0:01:52" "0:01:53" "0:01:55" 
    [34] "0:02:08" "0:02:12" "0:02:13" "0:02:21" "0:02:26" "0:02:27" "0:02:30" "0:02:32" "0:02:33" "0:02:36" "0:02:37" 
    [45] "0:02:38" "0:02:43" "0:02:45" "0:02:53" "0:02:56" "0:03:07" "0:03:15" "0:03:19" "0:03:21" "0:03:22" "0:03:24" 
    [56] "0:03:30" "0:03:36" "0:03:39" "0:03:41" "0:03:49" "0:03:56" "0:03:59" "0:04:02" "0:04:04" "0:04:07" "0:04:10" 
    [67] "0:04:11" "0:04:12" "0:04:14" "0:04:16" "0:04:17" "0:04:19" "0:04:22" "0:04:27" "0:04:28" "0:04:30" "0:04:37" 
    [78] "0:04:39" "0:04:41" "0:04:49" "0:04:51" "0:04:52" "0:04:53" "0:04:54" "0:05:05" "0:05:06" "0:05:20" "0:05:22" 

我只是不知道如何快速地斗的數據爲具體括號當有這麼多因子水平:數據結構如下。我想將它們分組到0:12:00 to 0:05:000:05:01 to 0:10:00等等。有了這麼多的因素水平,我在如何確定何時開始和結束分組方面只是有點失落。任何人都可以提供幫助嗎?萬桶以上,這成爲我傳統上做事的一個問題。

謝謝!

回答

4

您可以將時間戳拆分爲其組件:桶非常容易計算。

# Sample data 
n <- 10 
d <- data.frame(
    time = paste( 
    sample(0:23, n, replace=TRUE), 
    sample(0:59, n, replace=TRUE), 
    sample(0:59, n, replace=TRUE), 
    sep=":" 
), 
    value = rnorm(n) 
) 

# Split the "time" column into its components 
d$time <- as.character(d$time) 
times <- strsplit(d$time, ":") 
times <- lapply(times, as.numeric) 
times <- do.call(rbind, times) 
colnames(times) <- c("hour", "minute", "second") 
d <- cbind(times, d) 

# Build the buckets 
d$bucket <- paste(
    sprintf("%02d:%02d:00", d$hour, floor(d$minute/5) * 5), 
    sprintf("%02d:%02d:59", d$hour, floor(d$minute/5) * 5 + 4), 
    sep=" to " 
) 
1

您遇到的問題是您有一個有效的連續變量,您以特定的字符格式表示存儲爲因子。一個因素在這裏並不合適,因爲這些級別僅代表數據中出現的值,而不是一組預定義的可能值。它是一個字符矢量的事實是因爲它表示格式化數據類型的特定約定,即時間。我會猜測它是幾個小時:幾分鐘:秒,但考慮到您的示例中斷可能是幾天(?):小時:分鐘。如果是小時:分鐘:秒,那麼最好將這些時間表示爲來自chron包的times對象。如果你這樣做,那麼問題就變成了如何將連續變量分類爲離散組。這是通過cut函數完成的。

0

結合的答案/從@布賴恩·迪格斯& @Vincent Zoonekynd代碼,我會推薦幾個功能:

?strptime 
?POSIXlt 
?cut.POSIXt 


#create factorized time vector within data frame 
n <- 10 
d <- data.frame(
    time = as.factor(paste( 
    sample(0:23, n, replace=TRUE), 
    sample(0:59, n, replace=TRUE), 
    sample(0:59, n, replace=TRUE), 
    sep=":" 
)), 
    value = rnorm(n) 
) 

#convert to time format, then apply cuts per hour 
(d$time<- cut.POSIXt(strptime(d$time, format="%H:%M:%S"), breaks="hour")) 

如果您不想每小時休息可以用「一天」或別的東西。您也可以查看我們的this鏈接以獲取您的問題的答案,我通過查找「將字符串轉換爲時間」找到了答案。

HTH。