2013-06-20 28 views
-3

我有一個包含數千個蛋白質序列的大型fasta文件。我想把這個文件分成多個文件。如何將包含衆多蛋白質序列的fasta文件分割成多個文件

我使用的ActivePerl爲我的項目

+0

請把你的文件的一部分,並說明要如何把它分解成多個文件 –

+0

> GI | 1587000 | PRF || 2205291A CsgA蛋白 MWCIRLPACTPWSSTRVFCQRKAFSALMPCMRYVITGASRGIGFEFVQQLLLRGDTVEAGVRSPEGARRLEPLKQKAGNRLRIHALDVGDDDSVRAFATNVCTGPVDVLINNAGVSGLWCALGDVDYADMARTFTINALGPLRVTSAMLPGLRQGALRRVAHVTSRMGSLAANTDGGAYAYRMSKAALNMAVRSMSTDLRPEGFVTVLLHPGWVQTDMGGPDATLPAPDSVRGMLRVIDGLNPEHSGRFFDYQGTEVPW > GI | 1586813 | PRF || 2204381​​E ORF MGPRSIRGPGAFVFLESGAVALRAKTKTPKAEVKKAPLPFSKAVWKAVRAIPR 先生這是一種文件序列。像這些被保存在一個文件中,這個文件必須在包含大約500個序列的文件中被分割。每個 – user2503701

回答

1

多少個序列你們每個文件想要什麼?

你可以做這樣的事情

#!/usr/bin/perl -w 

my $fasta_file = "something.fasta"; 
my $seqs_per_file = 100; # whatever your batch size 

my $file_number = 1; # our files will be named like "something.fasta.1" 
my $seq_ctr = 0; 

open(FASTA, $fasta_file) || die("can't open $fasta_file"); 

while(<FASTA>) { 

    if(/^>/) { 

     # open a new file if we've printed enough to one file 
     if($seq_ctr++ % $seqs_per_file == 0) { 
     close(OUT); 
     open(OUT, "> " . $fasta_file . "." . $file_number++); 
     } 

    } 

    print OUT $_; 

} 
+0

先生OneSolitaryNoob我沒有運行你的程序,但它沒有顯示任何結果。 – user2503701

+0

錯誤發生在未打開filehandler的最後一個代碼上。你可以自己檢查 – user2503701

+0

你不應該使用裸文件句柄,而應該使用3參數形式的'open'。 – squiguy

0

你可以使用awk輕鬆,而不是perl的做到這一點。

awk '/^\>/{file=$0}{print >file".txt"}' your_fasta_file 
+0

我不知道awk,所以我很抱歉先生。你能幫我在Perl – user2503701

+0

任何人想修改一個perl腳本,計算等電點 – user2503701

-2

我知道你說你想在Perl中使用它。但我已經使用python與BioPython進行了很多次,我認爲這與BioPerl相當(但更好)。

import sys 
import Bio 
def write_file(input_file,split_number): 
    #get file_counter and base name of fasta_file 
    parent_file_base_name = input_file(".")[0] 
    counter = 1 

    #our first file name 
    file = parent_file_base_name + "_" + str(counter) + ".fasta" 

    #carries all of our records to be written 
    joiner = [] 
    #enumerate huge fasta 
    for num,record in enumerate(Bio.SeqIO.parse(input_file, "fasta"),start=1): 
     #append records to our list holder 
     joiner.append(">" + record.id + "\n" + str(record.seq)) 

     #if we have reached the maximum numbers to be in that file, write to a file, and then clear 
     #record holder 
     if num % split_number == 0: 
      joiner.append("") 
      with open(file,'w') as f: 
       f.write("\n".join(joiner))  

      #change file name,clear record holder, and change the file count 
      counter += 1 
      file = parent_file_base_name + "_" + str(counter) + ".fasta" 
      joiner = [] 
     if joiner: 
     joiner.append("") 
     with open(file,'w') as f: 
      f.write("\n".join(joiner)) 

if __name__ == "__main__": 
    input_file = sys.argv[1] 
    split_number = sys.argv[2] 
    write_file(input_file,split_number) 
    print "fasta_splitter.py is finished." 

只是運行

python script.py parent_fasta.fasta <how many records per file> 
0

此代碼是Java。我不介意管理員是否將其從此處移除。但如果它有幫助。 :)

/** 
* This tool aims to chop the file in various parts based on the number of sequences required in one file. 
*/ 
package devtools.utilities; 

import java.io.FileWriter; 
import java.io.IOException; 
import java.nio.charset.StandardCharsets; 
import java.nio.file.Files; 
import java.nio.file.Paths; 
import org.apache.commons.lang3.StringUtils; 

//import java.util.List; 

/** 
* @author Arpit 
* 
*/ 
public class FileChopper { 

    public void chopFile(String fileName, int numOfFiles) throws IOException { 
     byte[] allBytes = null; 
     String outFileName = StringUtils.substringBefore(fileName, ".fasta"); 

     try { 
      allBytes = Files.readAllBytes(Paths.get(fileName)); 
     } catch (IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 

     String allLines = new String(allBytes, StandardCharsets.UTF_8); 
     // Using a clever cheat with help from stackoverflow 
     String cheatString = allLines.replace(">", "~>"); 
     cheatString = cheatString.replace("\\s+", ""); 
     String[] splitLines = StringUtils.split(cheatString, "~"); 
     int startIndex = 0; 
     int stopIndex = 0; 

     FileWriter fw = null; 
     for (int j = 0; j < numOfFiles; j++) { 

      fw = new FileWriter(outFileName.concat("_") 
        .concat(Integer.toString(j)).concat(".fasta")); 
      if (j == (numOfFiles - 1)) { 
       stopIndex = splitLines.length; 
      } else { 
       stopIndex = stopIndex + (splitLines.length/numOfFiles); 
      } 
      for (int i = startIndex; i < stopIndex; i++) { 
       fw.write(splitLines[i]); 
      } 
      if (j < (numOfFiles - 1)) { 
       startIndex = stopIndex; 
      } 
      fw.close(); 
     } 

    } 

    /** 
    * @param args 
    */ 
    public static void main(String[] args) { 
     // TODO Auto-generated method stub 
     FileChopper fc = new FileChopper(); 
     try { 
      fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5); 
     } catch (IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 

    } 

} 
相關問題