2012-09-14 37 views
1

我有數據幀減少行獨特的項目

test <- structure(list(
    y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"), 
    y2003 = c("freshman","junior","junior","sophomore","sophomore","senior"), 
    y2004 = c("junior","sophomore","sophomore","senior","senior",NA), 
    y2005 = c("senior","senior","senior",NA, NA, NA)), 
       .Names = c("2002","2003","2004","2005"), 
       row.names = c(c(1:6)), 
       class = "data.frame") 
> test 
     2002  2003  2004 2005 
1 freshman freshman junior senior 
2 freshman junior sophomore senior 
3 freshman junior sophomore senior 
4 sophomore sophomore senior <NA> 
5 sophomore sophomore senior <NA> 
6 senior senior  <NA> <NA> 

而且我想Munge時間數據僅獲取每一行的各個步驟,如

result <- structure(list(
y2002 = c("freshman","freshman","freshman","sophomore","sophomore","senior"), 
y2003 = c("junior","junior","junior","senior","senior",NA), 
y2004 = c("senior","sophomore","sophomore",NA,NA,NA), 
y2005 = c(NA,"senior","senior",NA, NA, NA)), 
       .Names = c("1","2","3","4"), 
       row.names = c(c(1:6)), 
       class = "data.frame") 

> result 
      1  2   3  4 
1 freshman junior senior <NA> 
2 freshman junior sophomore senior 
3 freshman junior sophomore senior 
4 sophomore senior  <NA> <NA> 
5 sophomore senior  <NA> <NA> 
6 senior <NA>  <NA> <NA> 

我知道,如果我把每一行作爲一個向量,我可以做一些像

careerrow <- c(1,2,3,3,4) 
pairz <- lapply(careerrow,function(i){c(careerrow[i],careerrow[i+1])}) 
uniquepairz <- careerrow[sapply(pairz,function(x){x[1]!=x[2]})] 

我的困難是將這種行方式應用於我的數據表。我認爲lapply是要走的路,但到目前爲止我無法解決這個問題。

+0

你需要它是一個有效的data.frame填充NA值或與每個ID相關的列表足夠了嗎? – mnel

+0

我想統計相同的行,所以我認爲能夠將它作爲有效的data.frame是一件好事。或者列表清單是否便於執行此類計數? – dmvianna

回答

3

如果你的目的是計算每個途徑總數

你可以使用這樣的事情(使用data.table的,因爲它處理列表的方式爲好方式一個data.table(data.frame樣)對象中的元素。

我使用!duplicated(...)以便移除重複,因爲這是略高於獨特更有效。

library(data.table) 
library(reshape2) 
# make the rownames a column 
test$id <- rownames(test) 
# put in long format 
DT <- as.data.table(melt(test,id='id')) 
# get the unique steps and concatenate into a unique identifier for each pathway 
DL <- DT[!is.na(value), {.steps <- value[!duplicated(value)] 
    stepid <- paste(.steps, sep ='.',collapse = '.') 
    list(steps = list(.steps), stepid =stepid)}, by=id] 
## id       steps       stepid 
## 1: 1   freshman,junior,senior   freshman.junior.senior 
## 2: 2 freshman,junior,sophomore,senior freshman.junior.sophomore.senior 
## 3: 3 freshman,junior,sophomore,senior freshman.junior.sophomore.senior 
## 4: 4     sophomore,senior     sophomore.senior 
## 5: 5     sophomore,senior     sophomore.senior 
## 6: 6       senior       senior 

# count the number per path 

DL[, .N, by = stepid] 
##        stepid N 
## 1:   freshman.junior.senior 1 
## 2: freshman.junior.sophomore.senior 2 
## 3:     sophomore.senior 2 
## 4:       senior 1 
+0

+1漂亮的(首先我認爲)'list'列輸出('steps'列)的例子用漂亮的concat打印(1.8.2是新的)。 –

2

lapply當傳遞一個data.frame時,在其列上操作。這是因爲data.frame是一個列表,其元素是列。取而代之的lapply,您可以使用applyMARGIN=1

unique.padded <- function(x) { 
    uniq <- unique(x) 
    out <- c(uniq, rep(NA, length(x) - length(uniq))) 
} 

t(apply(test, 1, unique.padded)) 

# [,1]  [,2]  [,3]  [,4]  
# 1 "freshman" "junior" "senior" NA  
# 2 "freshman" "junior" "sophomore" "senior" 
# 3 "freshman" "junior" "sophomore" "senior" 
# 4 "sophomore" "senior" NA   NA  
# 5 "sophomore" "senior" NA   NA  
# 6 "senior" NA  NA   NA 

編輯:我看到你對你的最終目標發表評論。我會做這樣的事情:

table(sapply(apply(test, 1, function(x)unique(na.omit(x))), 
      paste, collapse = "_")) 

#   freshman_junior_senior freshman_junior_sophomore_senior 
#        1        2 
#       senior     sophomore_senior 
#        1        2