2016-11-18 60 views
3

我有製表符分隔的數據集,所以我想下面的數據集轉換成一個矩陣將字符串轉換數據集的矩陣

CATGGGGAAAACTGA 
CCTCTCGATCACCGA 
CCTATAGATCACCGA 
CCGATTGATCACCGA 
CCTTGTGCAGACCGA 

我用

rbind(strsplit("CATGGGGAAAACTGA","")[[1]], 
     strsplit("CCTCTCGATCACCGA","")[[1]], 
     strsplit("CCTCTCGATCACCGA","")[[1]], 
     strsplit("CCTATAGATCACCGA","")[[1]], 
     strsplit("CCGATTGATCACCGA","")[[1]], 
     strsplit("CCTTGTGCAGACCGA","")[[1]]) 

並且這產生:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] 
[1,] "C" "A" "T" "G" "G" "G" "G" "A" "A" "A" "A" "C" "T" "G" "A" 
[2,] "C" "C" "T" "C" "T" "C" "G" "A" "T" "C" "A" "C" "C" "G" "A" 
[3,] "C" "C" "T" "C" "T" "C" "G" "A" "T" "C" "A" "C" "C" "G" "A" 
[4,] "C" "C" "T" "A" "T" "A" "G" "A" "T" "C" "A" "C" "C" "G" "A" 
[5,] "C" "C" "G" "A" "T" "T" "G" "A" "T" "C" "A" "C" "C" "G" "A" 
[6,] "C" "C" "T" "T" "G" "T" "G" "C" "A" "G" "A" "C" "C" "G" "A" 

但是,當數據集非常大時,這個過程很累人。我怎麼能自動做到這一點?

+1

使用'do.call':類似'do.call(「rbind」,lapply(myDNAVec,strsplit,split =「」))''。 – lmo

+0

序列長度是否固定,始終爲15? – zx8754

+2

@lmo不需要'lapply'。 'strsplit(myDNAvec,split ='')'會起作用。 –

回答

5

你可以使用read.fwf分割成單個字符:

read.fwf(textConnection("CATGGGGAAAACTGA 
CCTCTCGATCACCGA 
CCTATAGATCACCGA 
CCGATTGATCACCGA 
CCTTGTGCAGACCGA"), rep(1, nchar("CATGGGGAAAACTGA"))) 
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 
#1 C A T G G G G A A A A C T G A 
#2 C C T C T C G A T C A C C G A 
#3 C C T A T A G A T C A C C G A 
#4 C C G A T T G A T C A C C G A 
#5 C C T T G T G C A G A C C G A 

你可能想傳遞一個文件名,而不是一個文字連接。