2016-10-04 62 views
-1

我有一個樣本(行)按物種(列)數據框。另一個數據框中的列將樣本編碼爲組。我想選擇所有組中的所有樣本都具有非零值的所有列。從數據框中選擇列的樣本組爲非零

種框架:

structure(list(Otu000132 = c(0L, 56L, 30L, 52L, 1L, 4L, 31L, 4L, 17L, 9L, 4L), 
       Otu000144 = c(191L, 14L, 58L, 137L, 127L, 222L, 26L, 175L, 133L, 107L, 43L), 
       Otu000146 = c(0L, 0L, 0L, 0L, 16L, 62L, 41L, 16L, 60L, 32L, 0L), 
       Otu000147 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
       Otu000151 = c(2L, 9L, 4L, 1L, 0L, 4L, 4L, 2L, 3L, 0L, 0L), 
       Otu000162 = c(2L, 1L, 0L, 0L, 1L, 1L, 0L, 2L, 1L, 0L, 0L), 
       Otu000164 = c(2L, 0L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
       Otu000174 = c(0L, 0L, 3L, 1L, 0L, 2L, 0L, 1L, 2L, 1L, 0L), 
       Otu000176 = c(1L, 9L, 0L, 1L, 2L, 5L, 3L, 3L, 8L, 2L, 2L), 
       Otu000186 = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), 
       Otu000190 = c(1L, 1L, 1L, 0L, 0L, 5L, 1L, 2L, 7L, 0L, 0L)), 
      .Names = c("Otu000132", "Otu000144", "Otu000146", "Otu000147", 
        "Otu000151", "Otu000162", "Otu000164", "Otu000174", 
        "Otu000176", "Otu000186", "Otu000190"), 
      row.names = 30:40, class = "data.frame") 

分組架:

structure(c(30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
      40, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), 
      .Dim = c(11L, 2L)) 

所需的輸出:

structure(list(Otu000132 = c(0L, 56L, 30L, 52L, 1L, 4L, 31L, 4L, 17L, 9L, 4L), 
       Otu000144 = c(191L, 14L, 58L, 137L, 127L, 222L, 26L, 175L, 133L, 107L, 43L), 
       Otu000151 = c(2L, 9L, 4L, 1L, 0L, 4L, 4L, 2L, 3L, 0L, 0L), 
       Otu000176 = c(1L, 9L, 0L, 1L, 2L, 5L, 3L, 3L, 8L, 2L, 2L), 
       Otu000190 = c(1L, 1L, 1L, 0L, 0L, 5L, 1L, 2L, 7L, 0L, 0L)), 
      .Names = c("Otu000132", "Otu000144", "Otu000151", 
        "Otu000176", "Otu000190"), 
      row.names = 30:40, class = "data.frame") 

我覺得這應該是東西,我可以dplyr選擇這樣做,但我無法弄清楚。任何人都有建議讓我走上一條道路?

+0

這不是很清楚。你的第三列是'Otu000146',它有4個0,即30,31和32是0.是否該列包含在所需的輸出中?否則'sp1 [!Reduce('&',lapply(split(gp1 [,1],gp1 [,2]),function(x){x1 < - sp1 [match(x,row.names(sp1)), ]; colSums(x1 == 0)> 0}))]'會給所有其他列。 – akrun

+0

我的錯誤,我認爲它是目前在所有的組2,但它不是 – thermophile

+0

你可以編輯你的文章,以改變預期的輸出 – akrun

回答

1

這確實可以用dplyr完成,並且以相當直接的方式完成。正如其他人指出的,「Otu000146」不符合您所描述的標準,並且不會包含在最終的列選中。

library(dplyr) 
library(tidyr) 

df.species <- cbind(species, group = grouping[,2]) %>% # merge the grouping variable into the main data set 
    gather(variable, value, -group) %>% # gather the columns into 'long' format 
    group_by(variable, group) %>% # group by column name and group 
    summarize(keep = all(value != 0)) %>% # variables and groups where all values are non-zero 
    ungroup %>% group_by(variable) %>% # reset grouping 
    summarize(keep = any(keep)) %>% # variables where at least 1 group met the aforementioned criterion 
    dplyr::filter(keep) # final list 

    variable keep 
     <chr> <lgl> 
1 Otu000132 TRUE 
2 Otu000144 TRUE 
3 Otu000151 TRUE 
4 Otu000176 TRUE 
5 Otu000190 TRUE 

# retrieve only the matching columns 
df.desired <- species[df.species$variable] 

    Otu000132 Otu000144 Otu000151 Otu000176 Otu000190 
30   0  191   2   1   1 
31  56  14   9   9   1 
32  30  58   4   0   1 
33  52  137   1   1   0 
34   1  127   0   2   0 
35   4  222   4   5   5 
36  31  26   4   3   1 
37   4  175   2   3   2 
38  17  133   3   8   7 
39   9  107   0   2   0 
40   4  43   0   2   0 
+0

聚集不在dplyr,需要加載tidyr以及 – thermophile

+0

你是對的,謝謝爲了捕獲。 – jdobres

1

我們split由第二(gp1[,2])到list分組數據集(「GP1」)的第一列中,循環通過list,通過與list匹配其行名字子集的種類的數據集的各行元素,獲得邏輯矩陣(x1==0)的列總和,檢查它是否大於0,比較每個list元素的相應元素,使用10中的&,否定(!)索引將TRUE更改爲FALSE(反之亦然)子集數據集的列。

sp1[!Reduce(`&`,lapply(split(gp1[,1], gp1[,2]), function(x) { 
       x1 <- sp1[match(x, row.names(sp1)),] 
       colSums(x1==0)>0}))] 
# Otu000132 Otu000144 Otu000151 Otu000176 Otu000190 
#30   0  191   2   1   1 
#31  56  14   9   9   1 
#32  30  58   4   0   1 
#33  52  137   1   1   0 
#34   1  127   0   2   0 
#35   4  222   4   5   5 
#36  31  26   4   3   1 
#37   4  175   2   3   2 
#38  17  133   3   8   7 
#39   9  107   0   2   0 
#40   4  43   0   2   0 
+0

這個工作,謝謝。我選擇了另一個,因爲理解樂器內部發生的事情是困難的 – thermophile

+0

@thermophile沒關係。謝謝你的提示。 – akrun

0

你可以用dplyr或只是基礎功能,這樣做:

species = merge(species, group, by.x=c("row.names"), by.y=c("V1")) 

#Find the lowest values in each grouping 
check = aggregate(species[,c("Otu000132", "Otu000144", "Otu000146", 
        "Otu000147", "Otu000151", "Otu000162", "Otu000164", 
        "Otu000174", "Otu000176", "Otu000186", "Otu000190")], 
        by=list(species$V2), min) 

#sum across the groupings 
vars = apply(check, 2, function(x) sum(x)) 

#retain variables where sum > 1, indicating at least one grouping has full observations 
vars = vars[vars!=0] 

#extract the variable names 
vars = names(vars)[-1] 

#subset dataset to select variables identified above 
out = species[vars] 

out 
# Otu000132 Otu000144 Otu000151 Otu000176 Otu000190 
#1   0  191   2   1   1 
#2   56  14   9   9   1 
#3   30  58   4   0   1 
#4   52  137   1   1   0 
#5   1  127   0   2   0 
#6   4  222   4   5   5 
#7   31  26   4   3   1 
#8   4  175   2   3   2 
#9   17  133   3   8   7 
#10   9  107   0   2   0 
#11   4  43   0   2   0 
+1

你有任何/所有的困惑。這將每個列的每個組相加,換句話說,該組中的ANY值是否非零。 OP想知道ALL值是否爲非零的組,然後帶有任何符合該條件的組的列。 – jdobres

+0

你是對的,感謝你的支持。我修改了代碼來解決這個問題。 – rocket1906