2012-12-15 78 views
3

我在處理縱向數據時遇到了一些麻煩:我的數據集每行包含一個唯一ID,然後是一系列訪問日期。每次訪問都有3個二分變量的值。難以在R中按行操縱縱向數據

data1 <- structure(list(V1date = structure(c(2L, 1L, 2L, 3L, 4L), .Label = c("1/22/12", "4/5/12", "8/18/12", "9/6/12"), class = "factor"), 
V1a = structure(c(1L, 1L, 2L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
V1b = structure(c(2L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
V1c = structure(c(1L, 2L, 1L, 1L, 1L), .Label = c("No", "Yes"), class = "factor"), 
V2date = structure(c(1L, 2L, 4L, 3L, NA), .Label = c("6/18/12", "7/5/12", "9/22/12", "9/4/12"), class = "factor"), 
V2a = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"), 
V2b = structure(c(1L, 1L, 1L, 1L, NA), .Label = "No", class = "factor"), 
V2c = structure(c(1L, 1L, 1L, 1L, NA), .Label = "Yes", class = "factor"), 
V3date = structure(c(NA, NA, 1L, NA, 2L), .Label = c("11/1/12", "12/4/12"), class = "factor"), 
V3a = structure(c(NA, NA, 1L, NA, 1L), .Label = "Yes", class = "factor"), 
V3b = structure(c(NA, NA, 1L, NA, 1L), .Label = "No", class = "factor"), 
V3c = structure(c(NA, NA, 2L, NA, 1L), .Label = c("No", "Yes"), class = "factor")), 
.Names = c("V1date", "V1a", "V1b", "V1c", "V2date", "V2a", "V2b", "V2c", "V3date", "V3a", "V3b", "V3c"), 
class = "data.frame", row.names = c("001", "002", "003", "004", "005")) 

data1  
    V1date V1a V1b V1c V2date V2a V2b V2c V3date V3a V3b V3c 
001 4/5/12 No Yes No 6/18/12 Yes No Yes <NA> <NA> <NA> <NA> 
002 1/22/12 No No Yes 7/5/12 Yes No Yes <NA> <NA> <NA> <NA> 
003 4/5/12 Yes No No 9/4/12 Yes No Yes 11/1/12 Yes No Yes 
004 8/18/12 No No No 9/22/12 Yes No Yes <NA> <NA> <NA> <NA> 
005 9/6/12 Yes No No <NA> <NA> <NA> <NA> 12/4/12 Yes No No 

在三個變量的8種不同的可能組合中,4個是「異常」,其餘4個是「正常」。每個人都開始出現異常,然後或者在隨後的訪問中繼續出現異常,或者在稍後訪問時解決爲正常模式(我忽略回覆到異常 - 一旦它們正常,它們正常)

我必須最後在數據框的右側添加4個新列,指示1)上次完成訪問的日期(不管介入「NAs」,2)ID是否最終解決或保持異常,3)如果解決,什麼是分辨率模式是和4)解決的日期是什麼。 NA總是以4個組的形式出現(即沒有訪問日期,並且3個變量沒有值)並被忽略。例如,如果模式「是 - 是 - 否 - 」,「是 - 否 - 是」,「是 - 是 - 是 - 是」和「是 - 是 - 是 - 是」是正常的並且其餘模式都是正常情況下,結果將會增加四列如下:

data2 <- structure(list(
LastVisDate = structure(c(3L, 2L, 3L, 3L, 2L), .Label = c("6/18/12", "12/4/12", "11/1/12", "9/22/12"), class = "factor"), 
Resolved = structure(c(2L, 2L, 2L, 2L, 1L), .Label = c("No", "Yes"), class = "factor"), 
Pattern = structure(c(1L, 1L, 1L, 1L, NA), .Label = "yny", class = "factor"), 
Resdate = structure(c(1L, 2L, 3L, 4L, NA), .Label = c("6/18/12", "7/5/12", "9/4/12", "9/22/12"), class = "factor")), 
.Names = c("LastVisDate", "Resolved", "Pattern", "Resdate"), 
class = "data.frame", row.names = c("001", "002", "003", "004", "005")) 

data2 
    LastVisDate Resolved Pattern Resdate 
001  11/1/12  Yes  yny 6/18/12 
002  12/4/12  Yes  yny 7/5/12 
003  11/1/12  Yes  yny 9/4/12 
004  11/1/12  Yes  yny 9/22/12 
005  12/4/12  No <NA> <NA> 

我花在這個項目了很多時間,但無法弄清楚如何詢問R鍵通過數據集向右前進,直到我停止規則是滿意的。建議非常感謝。

+0

檢出apply(),它將爲data.frame的每一行或每列應用一個函數,具體取決於您是否向apply()函數提供1或2。 – tcash21

+0

爲了清楚起見,如果您將我們期望的作爲這一小組行的最終數據框播下,它可能會有所幫助。 – A5C1D2H2I1M1N2O1R2T1

+0

data2似乎不符合data1。例如,data1中的第1行不包含日期7/5/12。 –

回答

1

這依賴於您的數據結構。特別是,從第2,6和10列開始有三個值被傳遞給確定某人是否「正常」的函數。

以下是確定某人是否「正常」的功能。還有其他的方法來寫這個。

is.normal <- function(x) { 
    any(c(
    all(x == c("Yes", "Yes", "No")), 
    all(x == c("Yes", "No", "Yes")), 
    all(x == c("No", "Yes", "Yes")), 
    all(x == c("Yes", "Yes", "Yes")) 
)) 
} 

我們使用這個,適用於適當的一組列。這取決於您在問題中指定的確切佈局。請注意傳遞給vapply的列號。這裏的結果是一個邏輯矩陣,告訴每個步驟某人是否「正常」。

ok <- vapply(c(2,6,10), 
     function(x) apply(data1[x:(x+2)], 1, is.normal), 
     logical(length(data1[,1]))) 

> ok 
    [,1] [,2] [,3] 
001 FALSE TRUE NA 
002 FALSE TRUE NA 
003 FALSE TRUE TRUE 
004 FALSE TRUE NA 
005 FALSE NA FALSE 

現在找到每個人第一次變得「正常」,如果有的話。通過檢查,對於每個人來說都是2,但最後一個人仍然不正常。 if用於防止Infmin返回值,當沒有實現正常。

date.ind <- apply(ok, 1, 
       function(x) { 
       y <- which(x) 
       if (length(y)) min(y) else NA 
       } 
) 

> date.ind 
001 002 003 004 005 
    2 2 2 2 NA 

然後,我們可以提取日期,從上面知道了「組」,以及如何獲得實際日期列,其中正常方式獲得:

dates <- vapply(seq_along(date.ind), 
       function(x) if (is.na(date.ind[x])) as.character(NA) else as.character(data1[x,date.ind[x]*4-3]), 
       character(1) 
       ) 
> dates 
[1] "6/18/12" "7/5/12" "9/4/12" "9/22/12" NA 

中提取的其他信息是相似的,如列索引可以如上計算。