2017-08-30 45 views
3

我有一個數據框,其中有一些數據在行的某些元素中用逗號連接。一些看起來像:從數據框中聚合多列

df <- data.frame(
c(2012,2012,2012,2013,2013,2013,2014,2014,2014) 
,c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i") 
) 
names(df) <- c("year", "type") 

我想要得到它的形式dcast接近它前往,與去年A,B,C等爲列,在整個數據的頻率幀位於結果數據幀的單元中。我首先嚐試colsplitdf然後使用dcast之後,但似乎只有工作,如果我想聚合的其中一個層面,而不是所有。

df2 <- data.frame(df$year, colsplit(df$type, ',' , c('v1','v2','v3','v4','v5'))) 
df3 <- dcast(df2, df.year ~ v1) 

這一結果只給了我爲colsplit的第一級,而不是全部。我接近解決方案還是應該完全使用不同的方法?

回答

0

也許,像這樣的東西可以工作?

# extract unique values and years 
    vals <- unique(do.call(c, strsplit(x = as.vector(df$type), "[[:punct:]]"))) 
    years <- unique(df$year) 

# count 
    df4 <- data.frame(sapply(vals, (function(vl) {sapply(years, (function(ye){ 
     sum(do.call(c, strsplit(as.vector(df$type[df$year == ye]) , "[[:punct:]]")) == vl) 
    }))}))) 
    df4 <- cbind(years, df4) 
    df4 
#result 
    years a b c d e f g h i 
1 2012 2 1 3 2 1 1 0 0 0 
2 2013 4 1 1 1 0 0 0 0 0 
3 2014 1 1 0 0 1 0 2 1 1 
1

您已接近解決方案。你只需要一步。您需要在dcast之前的一列中提供melt的所有值。看例子。

require(reshape2) 

df <- data.frame(c(2012,2012,2012,2013,2013,2013,2014,2014,2014), 
       c("a,b,c","d,e,f","a,c,d,c","a,a,a","b","c,a,d","g","a,b,e","g,h,i")) 
names(df) <- c("year", "type") 
df 

df2 <- data.frame(df$year, colsplit(df$type, ',', c('v1','v2','v3','v4','v5'))) 
df2 

df3 <- melt(df2, id.vars = "df.year", na.rm = T) 
df3 

df4 <- dcast(df3[df3$value != "", ], df.year ~ value, fun.aggregate = length) 
df4 
3

下面是與strsplit分裂「類型」列與base R單行選項,然後設置list輸出作爲「年」的名字,stack到一個單一的data.frame並獲取頻率算上使用table

table(stack(setNames(strsplit(as.character(df$type), ","), df$year))[2:1]) 
#  values 
#ind a b c d e f g h i 
# 2012 2 1 3 2 1 1 0 0 0 
# 2013 4 1 1 1 0 0 0 0 0 
# 2014 1 1 0 0 1 0 2 1 1 
1

這裏有一個data.table方法:

library(data.table) 
setDT(df) 
dcast(df[, .(unlist(strsplit(as.character(type), ",", fixed=TRUE))), by = year], 
year ~ V1, value.var = "V1", fun.aggregate = length) 
# year a b c d e f g h i 
#1: 2012 2 1 3 2 1 1 0 0 0 
#2: 2013 4 1 1 1 0 0 0 0 0 
#3: 2014 1 1 0 0 1 0 2 1 1 

我們˚F首先按逗號和每年分組將類型列分成長格式,然後將dcastlength一起作爲集合函數進行擴展。