2016-02-25 60 views
0

我有一個data.frame,它有兩列,一個唯一的標識符和一個結果。我需要循環遍歷data.frame,並計算有多少個唯一標識符,以及唯一結果的數量。結果列可以有三個可能的結果,正面,負面或不明確。因此,例如,如果有10個「RVP PCR」標識符,我需要創建一個包含「Count」,「Positive」,「Negative」,「Ambiguous」四列的行,並且在這些列中應該計算多少次他們發生了。因此,在具有10個「RVP PCR」標識符的示例中,輸出行應該顯示標識符而不是計數10,7個負數,1個正數和2個模糊數。你如何用R來完成這個任務?R通過data.frame循環並獲取變量計數

str(foo) 
> 
'data.frame': 51 obs. of 2 variables: 
$ identifier: Factor w/ 99 levels "ADENOPCR","ALB-BF",..: 51 51 56 56 57 57 57 57 18 18 ... 
$ result : Factor w/ 3 levels "Ambiguous","Negative",..: 2 1 2 1 2 1 2 1 2 1 ... 



dput(foo) 
> 
    structure(list(identifier = structure(c(80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 
80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 80L, 64L, 
18L, 18L, 76L, 76L, 76L, 70L, 70L, 70L, 70L, 71L, 64L, 77L, 77L, 
77L, 77L, 77L, 77L, 77L, 77L, 76L), .Label = c("ADENOPCR", "ALB-BF", 
"ASPERAG", "ASPERAGB", "BDGLUCAN", "BLASTO", "BORD PCR", "BPERT", 
"CMV QNT", "CMVPCR", "COCCI", "COCCI G/M", "COCCI PAN", "COCCI-PPT", 
"CPNEUMOPCR", "CRP", "CRY BLD", "CWP-KOH", "DIFF CONF", "EBV PAN", 
"EBV PAN 2", "EBV QNT", "EXCEPT", "EXCEPT TT", "FLUFAC", "FUNG PKG", 
"FUNGSEQ", "GLU-FL", "HERP I", "HHV6PCR", "HISTO", "HISTO PPT", 
"HISTOAG S", "HISTOGM U", "HMPVFA", "HMPVPCR", "HSVPCR", "LEGAG-U", 
"LEGIONFA", "LEGIONPCR", "MA AFB", "MA FUNGAL", "MA MIC", "MA MTBPRIM", 
"MC AFB", "MC AFBID", "MC AFBR", "MC BAL", "MC BLD", "MC CYST", 
"MC FUNG", "MC FUNGID", "MC Legion", "MC LEGION", "MC MTD", "MC NOC", 
"MC RESP", "MC STAPH", "MC Strep", "MC STREP", "MC VRE", "MC W", 
"MICROSEQ", "MPNEUMOPCR", "MS CWP", "MTBRIF PCR", "MYCO-M", "NG REPORT", 
"ORGSEQ", "PARAFLUPCR", "PCP PCR", "PNEUMO AB", "PNEUMST", "PNEUMST R", 
"RESPMINI", "RESPMINI ", "RSPFA", "RSPFAC", "RSV", "RVP PCR", 
"RVPPCR", "SPN AG", "TP-FL", "V CMVC", "V FLUC", "V HSVC", "V HSVCT", 
"V RESPC", "V Urea", "V VIC", "V VIC R", "V VIRAL", "V VIRAL N", 
"V VIRAL R", "V VZV", "VDRL CSF", "VZVFAC", "VZVPCR", "WNILE PCR" 
), class = "factor"), result = structure(c(2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 
2L, 2L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Ambiguous", 
"Negative", "Positive"), class = "factor")), .Names = c("identifier", 
"result"), row.names = 1500:1550, class = "data.frame") 

回答

2
library(dplyr) 
library(tidyr) 
foo %>% 
    group_by(identifier, result) %>% 
    summarise(n = n()) %>% 
    spread(key = result, value = n, drop = FALSE, fill = 0) %>% 
    mutate(Total = Ambiguous + Negative + Positive) %>% 
    filter(Total > 0) 

結果

Source: local data frame [7 x 5] 
Groups: identifier [7] 

    identifier Ambiguous Negative Positive Total 
     (fctr)  (dbl) (dbl) (dbl) (dbl) 
1 CWP-KOH   0  2  0  2 
2 MPNEUMOPCR   0  0  2  2 
3 PARAFLUPCR   0  3  1  4 
4 PCP PCR   0  0  1  1 
5 RESPMINI   0  4  0  4 
6  RSPFA   0  7  1  8 
7 RVP PCR   0  28  2 30 
+0

對於這個特定情況,我可以在'fill = 0'部分看到很多意義。沒有計數(0)在概念上不同於沒有記錄(NA)。 – PavoDive

+0

這正是我正在尋找的,謝謝! @Thierry – Nodedeveloper101

2

我不能完全確定你的期望的輸出是什麼,但你可以重塑你的數據:

library(reshape2) 

dcast(foo, identifier~result, fun.aggregate= length) 

這將產生:

identifier Negative Positive 
1 CWP-KOH  2  0 
2 MPNEUMOPCR  0  2 
3 PARAFLUPCR  3  1 
4 PCP PCR  0  1 
5 RESPMINI   4  0 
6  RSPFA  7  1 
7 RVP PCR  28  2 

###### ##編輯添加#############

隨着您提供的數據,沒有辦法「RVP PCR「將產生您所述的結果。

1

如果沒有額外的軟件包,你可以這樣做:如果你想有一個數據幀

> xtabs(~ identifier + result, data=droplevels(foo)) 
      result 
identifier Negative Positive 
    CWP-KOH   2  0 
    MPNEUMOPCR  0  2 
    PARAFLUPCR  3  1 
    PCP PCR   0  1 
    RESPMINI   4  0 
    RSPFA    7  1 
    RVP PCR   28  2 

as.data.frame(unclass(xtabs(~ identifier + result, data=droplevels(foo)))) 

如果你想要的結果在

xtabs(~ identifier + result, data=droplevels(foo)) 

這給出了這樣的結果長格式,你也可以這樣做:

foo$count <- 1 
aggregate(count ~ identifier+result, data=foo, FUN=length) 
+0

偉大的基礎解決方案,但可能有一些使用的結果是數據框? – PavoDive

+0

@PavoDive我編輯了我的答案,以包含一個數據框變體。如果你喜歡我的基礎解決方案,你可以投票。 – jogo

0
library(tidyr) 
library(dplyr) 

foo %>% 
    count(identifier, result) %>% 
    spread(result, n) # or spread(result, n, fill = 0, drop = FALSE) 

# identifier Negative Positive 
#  (fctr) (int) (int) 
# 1 CWP-KOH  2  NA 
# 2 MPNEUMOPCR  NA  2 
# 3 PARAFLUPCR  3  1 
# 4 PCP PCR  NA  1 
# 5 RESPMINI   4  NA 
# 6  RSPFA  7  1 
# 7 RVP PCR  28  2 
1

的數據是在長格式。首先使用reshape2庫中的dcast命令將其更改爲寬。添加一列並取所有行的總和。

library(reshape2)  
widedata<-dcast(foo,identifier~result) 
widedata$Count<-0 #adds column for Count 
widedata$Count<-rowSums (widedata[,2:4], na.rm = FALSE, dims = 1) #[,2:4] since the data will have a column for ambiguous as well.