2016-08-24 55 views
1

我有一個做了如下方式的數據幀兩兩部門:如何執行基於行的分組

df <- structure(list(celltype = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), .Label = c("Bcells", 
"DendriticCells", "Macrophages", "Monocytes", "NKCells", "Neutrophils", 
"StemCells", "StromalCells", "abTcells", "gdTCells"), class = "factor"), 
    sample = c("SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated", "SP ID control", "SP ID treated", "SP ID control", 
    "SP ID treated"), `mean(score)` = c(0.160953535029424, 0.155743474395545, 
    0.104788051104575, 0.125247035158472, -0.159665650045289, 
    -0.134662049979712, 0.196249441751866, 0.212256889027029, 
    0.0532668251890109, 0.0738264693971133, 0.151828478029596, 
    0.159941552142933, -0.14128323638966, -0.120556640790534, 
    0.196518649474078, 0.185264282171863, 0.0654641151966543, 
    0.0837989059507186, 0.145111577618456, 0.145448549866796)), .Names = c("celltype", 
"sample", "mean(score)"), row.names = c(7L, 8L, 17L, 18L, 27L, 
28L, 37L, 38L, 47L, 48L, 57L, 58L, 67L, 68L, 77L, 78L, 87L, 88L, 
97L, 98L), class = "data.frame") 

它看起來像這樣:

> df 
     celltype  sample mean(score) 
7   Bcells SP ID control 0.16095354 
8   Bcells SP ID treated 0.15574347 
17 DendriticCells SP ID control 0.10478805 
18 DendriticCells SP ID treated 0.12524704 
27 Macrophages SP ID control -0.15966565 
28 Macrophages SP ID treated -0.13466205 
37  Monocytes SP ID control 0.19624944 
38  Monocytes SP ID treated 0.21225689 
47  NKCells SP ID control 0.05326683 
48  NKCells SP ID treated 0.07382647 
57 Neutrophils SP ID control 0.15182848 
58 Neutrophils SP ID treated 0.15994155 
67  StemCells SP ID control -0.14128324 
68  StemCells SP ID treated -0.12055664 
77 StromalCells SP ID control 0.19651865 
78 StromalCells SP ID treated 0.18526428 
87  abTcells SP ID control 0.06546412 
88  abTcells SP ID treated 0.08379891 
97  gdTCells SP ID control 0.14511158 
98  gdTCells SP ID treated 0.14544855 

我想要做的是根據cell type分組內的treatedcontrol樣本計算得分分數。

下面的Excel圖像說明了這個例子。我們在最右欄之後。例如在Bcells(0.155/0.161 = 0.967)。

enter image description here

在這一天結束時,我想獲得的是看起來像這樣的DF:

celltype   sample   Pairwise division 
Bcells    SP ID treated 0.967630031 
DendriticCells  SP ID treated 1.195241574 
Macrophages   SP ID treated 0.843400255 
Monocytes   SP ID treated 1.081566841 
NKCells    SP ID treated 1.385974647 
Neutrophils   SP ID treated 1.053435786 
StemCells   SP ID treated 0.853297563 
StromalCells  SP ID treated 0.942731303 
abTcells   SP ID treated 1.280073915 
gdTCells   SP ID treated 1.002322158 

我如何能實現在R裏面?

回答

2

如果蔓延到廣泛的形式,這是很簡單的:

library(tidyr) 
library(dplyr) 

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated`/`SP ID control`) 

##   celltype SP ID control SP ID treated pairwise_division 
## 1   Bcells 0.16095354 0.15574347   0.9676300 
## 2 DendriticCells 0.10478805 0.12524704   1.1952416 
## 3  Macrophages -0.15966565 -0.13466205   0.8434003 
## 4  Monocytes 0.19624944 0.21225689   1.0815668 
## 5   NKCells 0.05326683 0.07382647   1.3859746 
## 6  Neutrophils 0.15182848 0.15994155   1.0534358 
## 7  StemCells -0.14128324 -0.12055664   0.8532976 
## 8 StromalCells 0.19651865 0.18526428   0.9427313 
## 9  abTcells 0.06546412 0.08379891   1.2800739 
## 10  gdTCells 0.14511158 0.14544855   1.0023222 

請注意,你應該解決您的列名,這樣你就不必經常使用反引號。

要獲得精確期望的結果,收集回長,過濾器,只處理行,並選擇所需的列:

df %>% spread(sample, `mean(score)`) %>% 
    mutate(pairwise_division = `SP ID treated`/`SP ID control`) %>% 
    gather(sample, `mean(score)`, starts_with('SP')) %>% 
    filter(sample == 'SP ID treated') %>% 
    select(celltype, sample, pairwise_division) 

##   celltype  sample pairwise_division 
## 1   Bcells SP ID treated   0.9676300 
## 2 DendriticCells SP ID treated   1.1952416 
## 3  Macrophages SP ID treated   0.8434003 
## 4  Monocytes SP ID treated   1.0815668 
## 5   NKCells SP ID treated   1.3859746 
## 6  Neutrophils SP ID treated   1.0534358 
## 7  StemCells SP ID treated   0.8532976 
## 8 StromalCells SP ID treated   0.9427313 
## 9  abTcells SP ID treated   1.2800739 
## 10  gdTCells SP ID treated   1.0023222 

等效版本是在基地可能與data.table,如果你喜歡。或採取直接的路線:

aggregate(cbind(pairwise_division = `mean(score)`) ~ celltype, 
      df[order(df$celltype, df$sample), ], 
      FUN = function(x){x[2]/x[1]}) 

##   celltype pairwise_division 
## 1   Bcells   0.9676300 
## 2 DendriticCells   1.1952416 
## 3  Macrophages   0.8434003 
## 4  Monocytes   1.0815668 
## 5   NKCells   1.3859746 
## 6  Neutrophils   1.0534358 
## 7  StemCells   0.8532976 
## 8 StromalCells   0.9427313 
## 9  abTcells   1.2800739 
## 10  gdTCells   1.0023222 
+0

謝謝,但怎麼來的值不是你的結果的第一行'0.967630031'? – neversaint

+2

糟糕,向後分開並貼出錯誤的版本。固定。 – alistaire

5

如果您的數據是有序和完全配對:

pair_index <- 1:(dim(df)[1]/2)*2 
df[pair_index,'pairwise-division'] <- df[pair_index,3]/df[pair_index-1,3] 
df[pair_index,c(1,2,4)]