分割FASTA文件和第一線的基礎

我有以下內容一個巨大的文件上的重命名：分割FASTA文件和第一線的基礎

文件名：input.txt中

>chr1 
jdlfnhl 
dh,ndh 
dnh. 

dhjl 

>chr2 
dhfl 
dhl 
dh;l 

>chr3 

shgl 
sgl 

>chr2_random 
dgld

我需要這個文件以這樣的方式分割我得到四個單獨的文件，如下：

文件1：chr1.fa

>chr1 
jdlfnhl 
dh,ndh 
dnh. 

dhjl

文件2：chr2.fa

>chr2 
dhfl 
dhl 
dh;l

文件3：chr3.fa

>chr3 

shgl 
sgl

文件4：chr2_random.fa

>chr2_random 
dgld

我試圖在Linux csplit可，但不能被文本後，立即將其重命名「 >」。

csplit -z input.txt '/>/' '{*}'

來源

2012-08-05 learner

既然你表示你在一個Linux機器上，'awk'似乎是正確的工具。

用法：
./foo.awk your_input_file

foo.awk：

#!/usr/bin/awk -f 

/^>chr/ { 
    OUT=substr($0,2) ".fa" 
} 

OUT { 
    print >OUT 
}

你能做到這一點也一行：

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

來源

2012-08-05 17:38:53

在哪裏添加「.fa」擴展 – learner 2012-08-05 17:45:57

我忘了 - 現在修復它 – 2012-08-05 17:46:37

還有一點：此解決方案保留這是一個問題嗎？ – 2012-08-05 17:49:47

with open('data.txt') as f: 
    lines=f.read() 
    lines=lines.split('>') 
    lines=['>'+x for x in lines[1:]] 
    for x in lines: 
     file_name=x.split('\n')[0][1:] #use this variable to create the new file 
     fil=open(file_name+'.fa','w') 
     fil.write(x) 
     fil.close()

來源

2012-08-05 17:32:16

這個問題提到了一個「龐大」的文件，有可能加載整個事情就可能會使用太多的內存...也許沒有，所以+1 – dbr 2012-08-05 17:42:13

如果你特別希望與Python來試試這個，你可以使用此代碼

f2 = open("/dev/null", "r") 
f = open("input.txt", "r") 
for line in f: 
    if ">" in line: 
     f2.close() 
     f2 = open(line.split(">")[1]),"w") 
    else: 
     f2.write(line) 

f.close()

來源

2012-08-05 17:36:20 perilbrain

縮進丟失，因爲是'：'上'else' – dbr 2012-08-05 17:40:00

雅我試過編輯，但它獲取請求超時，不知道什麼是錯誤的歌劇:( – perilbrain 2012-08-05 17:43:39

略顯凌亂的劇本，但應該在一個大的文件工作，因爲它只能讀取的一行一時間

要運行，你做python thescript.py input.txt（或者它會從標準輸入讀取，像cat input.txt | python thescript.py）

import sys 
import fileinput 

in_file = False 

for line in fileinput.input(): 
    if line.startswith(">"): 
     # Close current file 
     if in_file: 
      f.close() 

     # Make new filename 
     fname = line.rstrip().partition(">")[2] 
     fname = "%s.fa" % fname 

     # Open new file 
     f = open(fname, "w") 
     in_file = True 

     # Write current line 
     f.write(line) 

    elif in_file: 
     # Write line to currently open file 
     f.write(line) 

    else: 
     # Something went wrong, no ">chr1" found yet 
     print >>sys.stderr, "Line %r encountered, but no preceeding > line found"

來源

2012-08-05 17:38:22 dbr

如果你發現自己想用FASTA/FAST做更復雜的事情Q文件，你應該考慮Biopython。

這裏有一個職位有關修改和重寫FASTQ文件：http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

而另約分裂FASTA文件：http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html

來源

2012-08-05 18:27:25 Lenna

你最好的選擇是從exonerate suite使用fastaexplode程序：

$ fastaexplode -h 
fastaexplode from exonerate version 2.2.0 
Using glib version 2.30.2 
Built on Jan 12 2012 
Branch: unnamed branch 

fastaexplode: Split a fasta file up into individual sequences 
Guy St.C. Slater. [email protected] 2000-2003. 

Synopsis: 
-------- 
fastaexplode <path> 

General Options: 
--------------- 
-h --shorthelp [FALSE] <TRUE> 
    --help [FALSE] 
-v --version [FALSE] 

Sequence Input Options: 
---------------------- 
-f --fasta [mandatory] <*** not set ***> 
-d --directory [.] 

--

來源

2012-08-14 02:18:28 terdon

或者，BioPython可能已被使用。在virtualenv中安裝它很簡單：

virtualenv biopython_env 
source biopython_env/bin/activate 
pip install numpy 
pip install biopython

而且一旦完成，拆分fasta文件很容易。讓我們假設你有路徑的fasta_file變量FASTA文件：

from Bio import SeqIO 

parser = SeqIO.parse(fasta_file, "fasta") 
for entry in parser: 
    SeqIO.write(entry, "chr{}.fa".format(entry.id), "fasta")

需要注意的是這個版本格式的Python2.7的作品，但它可能無法正常工作在舊版本。

至於性能，我用這個在可忽略不計的時間將人類基因組引用從1000基因組項目中分離出來，但我不知道它如何適用於較大的文件。

來源

2013-08-22 18:56:41 legaultmarc

'str.format（）'在2.6 [docs]中引入（http://docs.python.org/library/stdtypes.html ＃str.format） – Lenna 2014-01-09 17:23:15

#!/usr/bin/perl-w 
use strict; 
use warnings; 


my %hash =(); 
my $key = ''; 
open F, "input.txt", or die $!; 
while(<F>){ 
    chomp; 
    if($_ =~ /^(>.+)/){ 
     $key = $1; 
    }else{ 
     push @{$hash{$key}}, $_ ; 
    } 
} 

foreach(keys %hash){ 
    my $key1 = $_; 
    my $key2 =''; 
    if($key1 =~ /^>(.+)/){ 
     $key2 = $1; 
    } 
    open MYOUTPUT, ">","$key2.fa", or die $!; 
    print MYOUTPUT join("\n",$_,@{$hash{$_}}),"\n"; 
    close MYOUTPUT; 
}

來源

2014-01-02 18:23:07 user3138373

分割FASTA文件和第一線的基礎

回答

相關問題