重塑數據（更快的方法）

我遇到了freq的表格。今天我必須擴展到原始數據的數據框架。我能夠做到這一點，但想知道是否有更快的方式使用重塑包或data.table？重塑數據（更快的方法）

原始表看起來像這樣：

i1 i2 i3 i4 m f 
1 0 0 0 0 22 29 
2 1 0 0 0 30 50 
3 0 1 0 0 13 15 
4 0 0 1 0 1 6 
5 1 1 0 0 24 67 
6 1 0 1 0 5 12 
7 0 1 1 0 1 2 
8 1 1 1 0 10 22 
9 0 0 0 1 10 7 
10 1 0 0 1 27 30 
11 0 1 0 1 14 4 
12 0 0 1 1 1 0 
13 1 1 0 1 54 63 
14 1 0 1 1 8 10 
15 0 1 1 1 8 6 
16 1 1 1 1 57 51

下面是一個使用dput數據容易搶：

dat <- structure(list(i1 = c(0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 
0L, 0L, 1L, 1L, 0L, 1L), i2 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 
0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L), i3 = c(0L, 0L, 0L, 1L, 0L, 1L, 
1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L), i4 = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), m = c(22L, 30L, 
13L, 1L, 24L, 5L, 1L, 10L, 10L, 27L, 14L, 1L, 54L, 8L, 8L, 57L 
), f = c(29L, 50L, 15L, 6L, 67L, 12L, 2L, 22L, 7L, 30L, 4L, 0L, 
63L, 10L, 6L, 51L)), .Names = c("i1", "i2", "i3", "i4", "m", 
"f"), class = "data.frame", row.names = c(NA, -16L))

我的做法（S）重塑數據（有沒有更快的方式）：

#step 1: method 1 (in this case binding and stacking uses less code than reshape) 
dat2 <- data.frame(rbind(dat[,1:4], dat[, 1:4]), 
    sex = rep(c('m', 'f'), each=16), 
    n = c(dat$m, dat$f)) 
dat2 

#step 1: method 2  
dat3 <- reshape(dat, direction = "long", idvar = 1:4, 
    varying = list(c("m", "f")), 
    v.names = c("n"), 
    timevar = "sex", 
    times = c("m", "f")) 
    rownames(dat3) <- 1:nrow(dat3) 
    dat3 <- data.frame(dat3) 
    dat3$sex <- as.factor(dat3$sex) 

all.equal(dat3, dat2) #just to show both method 1 and 2 give the same data frame 

#step 2 
dat4 <- dat2[rep(seq_len(nrow(dat2)), dat2$n), 1:5] 
rownames(dat4) <- 1:nrow(dat4) 
dat4

我認爲這是一個常見的問題，因爲當你想從文章中獲取表格並重現它時，它需要進行一些拆包。我發現自己越來越多地想要確保自己的工作效率。

來源

2012-03-30 Tyler Rinker

這裏是一行代碼。

dat2 <- ddply(dat, 1:4, summarize, sex = c(rep('m', m), rep('f', f)))

來源

2012-03-30 01:48:16 Ramnath

我會使用melt用於第一步和ddply用於第二。

library(reshape2) 
library(plyr) 
d <- ddply( 
    melt(dat, id.vars=c("i1","i2","i3","i4"), variable.name="sex"), 
    c("i1","i2","i3","i4","sex"), 
    summarize, 
    id=rep(1,value) 
) 
d$id <- cumsum(d$id)

來源

2012-03-30 01:26:38

I li遠比我的方法更好。如果沒有人提出更有效的方法（少寫代碼，而不是速度），我會將其標記爲正確的答案。 +1 – 2012-03-30 01:33:56

我認爲這是正確的。我認爲任何人都無法擊敗這個數量的代碼。 – 2012-03-30 01:44:55

再次檢查:-) – Ramnath 2012-03-30 01:56:07

這裏是一個基本的R單線程。

dat2 <- cbind(dat[c(rep(1:nrow(dat), dat$m), rep(1:nrow(dat), dat$f)),1:4], 
       sex=c(rep("m",sum(dat$m)), rep("f", sum(dat$f))))

或者多一點的一般：

d1 <- dat[,1:4] 
d2 <- as.matrix(dat[,5:6]) 
dat2 <- cbind(d1[rep(rep(1:nrow(dat), ncol(d2)), d2),], 
       sex=rep(colnames(d2), colSums(d2)))

來源

2012-03-30 02:24:30 Aaron

不錯的基礎工作+1 – 2012-03-30 02:54:53

鑑於沒有人發佈了一個data.table解決方案（如在原來的問題提出的建議）

library(data.table) 
DT <- as.data.table(dat) 
DT[,list(sex = rep(c('m','f'),c(m,f))), by= list(i1,i2,i3,i4)]

或者，更簡潔

DT[,list(sex = rep(c('m','f'),c(m,f))), by= 'i1,i2,i3,i4']

來源

2012-10-03 04:31:17 mnel

可以修改'c（m，f）'（和'list（i1，i2，i3，i4）'）來引用一個包含列名的變量嗎？例如，而不是m和f列，如果我有100列（比如Var0到Var99）並且不想輸出每列的名稱會怎麼樣。 – dnlbrky 2013-07-06 15:46:09

重塑數據（更快的方法）

回答

相關問題