2014-12-05 91 views
2

其實我有同樣的問題,這種情況下strsplit one column with exact information into two column拆分一列R中兩列循環

這個問題已經解決了,只是我的數據看起來就像

 SNP Geno AlleleA AlleleB AlleleC AlleleD AlleleE 
1 marker1 G1  AA  AA  AA  AA  AA 
2 marker2 G1  TT  TT  TT  TT  TT 
3 marker3 G1  TT  TT  TT  TT  TT 
4 marker1 G2  CC  CC  CC  CC  CC 
5 marker2 G2  AA  AA  AA  AA  AA 
6 marker3 G2  TT  TT  TT  TT  TT 
7 marker1 G3  GG  GG  GG  GG  GG 
8 marker2 G3  AA  AA  AA  AA  AA 
9 marker3 G3  TT  TT  TT  TT  TT 

dput輸出:

structure(list(SNP = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L), .Label = c("marker1", "marker2", "marker3"), class = "factor"), 
    Geno = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("G1", 
    "G2", "G3"), class = "factor"), AlleleA = structure(c(1L, 
    4L, 4L, 2L, 1L, 4L, 3L, 1L, 4L), .Label = c("AA", "CC", "GG", 
    "TT"), class = "factor"), AlleleB = structure(c(1L, 4L, 4L, 
    2L, 1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", 
    "CC", "GG", "TT")), AlleleC = structure(c(1L, 4L, 4L, 2L, 
    1L, 4L, 3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", 
    "GG", "TT")), AlleleD = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 
    3L, 1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", 
    "TT")), AlleleE = structure(c(1L, 4L, 4L, 2L, 1L, 4L, 3L, 
    1L, 4L), class = "factor", .Label = c("AA", "CC", "GG", "TT" 
    ))), .Names = c("SNP", "Geno", "AlleleA", "AlleleB", "AlleleC", 
"AlleleD", "AlleleE"), row.names = c(NA, -9L), class = "data.frame") 

在這個問題上,他只有一列想分成兩列。問題是我有5000列(AlleleA,AlleleB .........等),想分裂(每一列到兩列)

我試過使用這樣的循環,但它doesnt工作,

for(i in colnames(dat)){ 
    dat1 <- data.frame(do.call(rbind, strsplit(as.vector(sprintf("dat$%s",i)), split = ""))) 
} 

我會等你的光, 謝謝

+0

如何分割列? (每列只有兩列,分割的定義如何?)。在tidyr中有一個單獨的函數,可以將列分成多列,你可以將它應用到你想要分割的每一列,例如dplyr的mutate_each函數。 – 2014-12-05 09:32:06

+0

@beginneR我修改了我的問題 – user46543 2014-12-05 09:40:28

+0

@beginneR其作品使用splitstackshape :)感謝Ananda Mahto – user46543 2014-12-05 09:45:31

回答

4

您可以使用cSplit從我的 「splitstackshape」 包的說法stripWhite = FALSE

例如,如果我們想拆所有的「等位基因*」欄目,我們會做:

library(splitstackshape) 
cSplit(mydf, grep("Allele", names(mydf)), "", stripWhite = FALSE) 
#  SNP Geno AlleleA_1 AlleleA_2 AlleleB_1 AlleleB_2 AlleleC_1 
# 1: marker1 G1   A   A   A   A   A 
# 2: marker2 G1   T   T   T   T   T 
# 3: marker3 G1   T   T   T   T   T 
# 4: marker1 G2   C   C   C   C   C 
# 5: marker2 G2   A   A   A   A   A 
# 6: marker3 G2   T   T   T   T   T 
# 7: marker1 G3   G   G   G   G   G 
# 8: marker2 G3   A   A   A   A   A 
# 9: marker3 G3   T   T   T   T   T 
# AlleleC_2 AlleleD_1 AlleleD_2 AlleleE_1 AlleleE_2 
# 1:   A   A   A   A   A 
# 2:   T   T   T   T   T 
# 3:   T   T   T   T   T 
# 4:   C   C   C   C   C 
# 5:   A   A   A   A   A 
# 6:   T   T   T   T   T 
# 7:   G   G   G   G   G 
# 8:   A   A   A   A   A 
# 9:   T   T   T   T   T 
2

由於@beginneR說,你可以使用tidyr::separate。下面是取自一個例子:http://blog.rstudio.org/2014/07/22/introducing-tidyr/

head(tidier, 8) 

#> id  trt  key time 
#> 1 1 treatment work.T1 0.08514 
#> 2 2 control work.T1 0.22544 
#> 3 3 treatment work.T1 0.27453 
#> 4 4 control work.T1 0.27231 
#> 5 1 treatment home.T1 0.61583 
#> 6 2 control home.T1 0.42967 
#> 7 3 treatment home.T1 0.65166 
#> 8 4 control home.T1 0.56774 

tidy <- tidier %>% 
    separate(key, into = c("location", "time"), sep = "\\.") 
tidy %>% head(8) 
#> id  trt location time time 
#> 1 1 treatment  work T1 0.08514 
#> 2 2 control  work T1 0.22544 
#> 3 3 treatment  work T1 0.27453 
#> 4 4 control  work T1 0.27231 
#> 5 1 treatment  home T1 0.61583 
#> 6 2 control  home T1 0.42967 
#> 7 3 treatment  home T1 0.65166 
#> 8 4 control  home T1 0.56774 
+1

我認爲*這個問題更多地涉及到必須在多個*列上進行這樣的分割。 – A5C1D2H2I1M1N2O1R2T1 2014-12-05 09:40:19

+0

你是對的,我沒有看清楚這個問題,也沒有@ beginneR的評論。 – 2014-12-05 09:41:45

+1

實際上,我不太清楚這是否可以使用'mutate_each'和'separate'的組合完成,至少不像Ananda的答案那樣靈活,因爲單獨需要您指定要分割的每個列柱。 – 2014-12-05 09:55:05

3

另一種選擇是可能

library(qdap) 
res <- colsplit2df(dat, splitcols=2:ncol(dat),sep='') 
colnames(res)[-1] <- make.names(rep(colnames(dat)[-1],each=2), unique=TRUE) 
res[1:3,1:5] 
#  SNP Geno Geno.1 AlleleA AlleleA.1 
#1 marker1 G  1  A   A 
#2 marker2 G  1  T   T 
#3 marker3 G  1  T   T 

或只爲Allele

colsplit2df(dat, splitcols=grep('Allele', names(dat)),sep='') 

編輯(泰勒林克)

我建議編輯th e列名的數據幀使用setNames首先如下:

setNames(dat, gsub("([A-Z]{1}[a-z]+[A-Z])", "\\1.1&\\1.2", names(dat))) %>% 
    colsplit2df(splitcols=3:ncol(dat), sep='')