2016-11-29 67 views
0

我想要做的事很簡單。但是,我是R新手,並沒有學到太多有關循環和函數的知識,也不確定什麼纔是最有效的方法來獲得結果。基本上,我想計算符合我的條件並進行分區的行數。下面是一個例子:根據條件重複計算

df1 <- data.frame(
    Main = c(0.0089, -0.050667, -0.030379, 0.066484, 0.006439, -0.026076), 
    B = c(NA, 0.0345, -0.0683, -0.052774, 0.014661, -0.040537), 
    C = c(0.0181, 0, -0.056197, 0.040794, 0.03516, -0.022662), 
    D = c(-0.0127, -0.025995, -0.04293, 0.057816, 0.033458, -0.058382) 
) 
df1 
# Main  B   C   D 
# 1 0.008900 NA   0.018100 -0.012700 
# 2 -0.050667 0.034500 0.000000 -0.025995 
# 3 -0.030379 -0.068300 -0.056197 -0.042930 
# 4 0.066484 -0.052774 0.040794 0.057816 
# 5 0.006439 0.014661 0.035160 0.033458 
# 6 -0.026076 -0.040537 -0.022662 -0.058382 

我對分子的標準是計算的B/C/D>0Main>0數目;對於分母,請計算B/C/D的數量即!= 0Main!= 0。我可以使用length(which(df1$Main >0 & df1$B>0))/length(which(df1$Main !=0 & df1$B !=0))分別獲取每個列的比率。但我的數據集有更多列,我想知道是否有辦法讓那些比一下子讓我的結果會是這樣的:

# B   C   D 
# 1 0.2  0.6  0.3 

回答

2

使用適用於:

apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0))/length(which(df1$Main !=0 & x !=0))) 
1
criteria1 <- df1[which(df1$Main > 0), -1] > 0 
criteria2 <- df1[which(df1$Main != 0), -1] != 0 
colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T) 
##   B   C   D 
## 0.2000000 0.6000000 0.3333333 

編輯:看來NIEK的方法對於此特定數據最快

# Unit: microseconds 
#   expr  min  lq  mean median  uq  max neval 
#  Jim(df1) 216.468 230.0585 255.3755 239.8920 263.6870 802.341 300 
# emilliman5(df1) 120.109 135.5510 155.9018 142.4615 156.0135 1961.931 300 
#  Niek(df1) 97.118 107.6045 123.5204 111.1720 119.6155 1966.830 300 
#  nine89(df1) 211.683 222.6660 257.6510 232.2545 252.6570 2246.225 300 
#[[1]] 
#   [,1] [,2]  [,3] [,4] 
#median 239.892 142.462 111.172 232.255 
#ratio 1.000 0.594 0.463 0.968 
#diff  0.000 -97.430 -128.720 -7.637 

但是,當列數很多時,矢量化方法更快。

Nrow <- 1000 
Ncol <- 1000 
mat <- matrix(runif(Nrow*Ncol),Nrow) 
df1 <- data.frame(Main = sample(-2:2,Nrow,T), mat) #1001 columns 

#Unit: milliseconds 
#   expr  min  lq  mean median  uq  max 
#  Jim(df1) 46.75627 53.88500 66.93513 56.58143 62.04375 185.0460 
#emilliman5(df1) 73.35257 91.87283 151.38991 178.53188 185.06860 292.5571 
#  Niek(df1) 68.17073 76.68351 89.51625 80.14190 86.45726 200.7119 
# nine89(df1) 51.36117 56.79047 74.53088 60.07220 66.34270 191.8294 

#[[1]] 
#   [,1] [,2] [,3] [,4] 
#median 56.581 178.532 80.142 60.072 
#ratio 1.000 3.155 1.416 1.062 
#diff 0.000 121.950 23.560 3.491 

功能

Jim <- function(df1){ 
    criteria1 <- df1[which(df1$Main > 0), -1] > 0 
    criteria2 <- df1[which(df1$Main != 0), -1] != 0 
    colSums(criteria1, na.rm = T)/colSums(criteria2, na.rm = T) 
} 


emilliman5 <- function(df1){ 
    apply(df1[,-1], 2, function(x) length(which(df1$Main >0 & x>0))/length(which(df1$Main !=0 & x !=0))) 
} 

Niek <- function(df1){ 
    ratio1<-vector() 
    for(i in 2:ncol(df1)){ 
     ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0))/length(which(df1$Main !=0 & df1[,i] !=0)) 
    } 
    ratio1 
} 

nine89 <- function(df){ 
    tail(colSums(df[df$Main>0,]>0, na.rm = T)/colSums(df[df$Main!=0,]!=0, na.rm = T), -1) 
} 
1

一種方式做,這將是一個for循環遍歷列,並應用你寫的功能。事情是這樣的:

ratio1<-vector() 
for(i in 2:ncol(df1)){ 
ratio1[i-1] <- length(which(df1$Main >0 & df1[,i]>0))/length(which(df1$Main !=0 & df1[,i] !=0)) 
} 

也許有更好的方法來做到這一點與應用或data.table,但是這是一個簡單的解決方案,我可以拿出。適用於任意數量的列。如果您想要一位小數的答案,請使用round()

2

你可以這樣做矢量(無applyfor需要):

tail(colSums(df[df$Main>0,]>0, na.rm = T)/colSums(df[df$Main!=0,]!=0, na.rm = T), -1) 

#  B   C   D 
#0.2000000 0.6000000 0.3333333