將公式應用到數據行，它跨越多行

我剛剛開始嘗試學習一些Python的第一步。目前正在通過旨在教授生物信息學python技能的Rosalind在線課程。（非常好，請參閱：rosalind.info）將公式應用到數據行，它跨越多行

我正在努力解決一個特定的問題。我在FASTA格式的文件，其具有形式，因此：

>Sequence_Header_1 
ACGTACGTACGTACGTACGT 
ACGTACGTACGTACGTACGT 
>Sequence_Header_2 
ACGTACGTACGTACGTACGT 
ACGTACGTACGTACGTACGT

我需要計算G和C的百分比（不包括報頭）的文件的各條目，並返回該數目，例如：

>Sequence_Header_1 
48.75% 
>Sequence_header_2 
52.43%

到目前爲止我的代碼是：

file = open("input.txt" , "r") 
for line in file: 
    if line.startswith(">"): 
     print(line.rstrip())   
    else: 
     print ('%3.2f' % (line.count('G')+line.count('C')/len(line)*100)) 
file.close()

這是做幾乎什麼，我需要做的。我只是遇到了序列數據跨越多行的麻煩。目前，我得到的文件中的每一行的％GC含量，而不是爲每個條目返回一個單一的數字，例如：

>Sequence_Header_1 
48.75% 
52.65% 
>Sequence_header_2 
52.43% 
50.25%

我如何運用我的公式來橫跨多行的數據？

由於提前，

來源

2013-07-10 mu0u506a

我認爲這是你在找什麼：

# Read the input file 
with open("input.txt", "r") as f: 
    s = f.read() 

# Loop through the Sequences 
for i, b in enumerate(s.split("Sequence_Header_")): 
    if not b: continue # At least the first one is empty 
         # because there is no data before the first sequence 
    # Join the lines with data 
    data = "".join(b.split("\n")[1:]) 

    # Print the result 
    print("Sequence_Header_{i}\n{p:.2f}%".format(
     i=i, p=(data.count('G')+data.count('C'))/len(data)*100))

注：我無法找到你的榜樣的「>」符號。如果您的頭文件以>開頭，那麼您可以將代碼重新寫入s.split（「>」），代碼仍然可以正常工作。

來源

2013-07-10 14:43:10 joente

嗨喬恩特，我的錯誤，標題確實應該以「>」開頭。感謝您的輸入。那裏有很多東西對我來說是全新的。試着讓我的simple-noob版本開始工作！ – mu0u506a

嘗試保持運行計數，然後在找到新標題時重新設置該計數器。

count = 0.0 
line_length=0.0 
seen_header = False 
for line in open("input.txt" , "r"): #saves memory. 
    if line.startswith('>'): 
     if not seen_header: 
      header = line.strip() 
      seen_header=True 
     if line_length > 0.0: 
      print header,'\n',('%3.2f' % (100.0*count/line_length)) 
      count = 0.0 
      line_length=0.0 
      seen_header = False 
    else: 
     count += line.count('C') + line.count('C') 
     line_length += len(line.strip()) 
print header,'\n',('%3.2f' % (100.0*count/line_length))

還要小心在python中的劃分，記住默認是整數除法。即5/2 = 2.您可以通過在變量中使用小數或float（）來避免這種情況。

編輯：更好，也應該是line_length + = len（line.strip（）），以避免將換行符「\ n」計爲兩個字符。

來源

2013-07-10 14:48:40 seth

你好，我把你的工作方式作爲靈感，並試圖做出我自己的版本。對於爲什麼len（）函數似乎添加了幾個額外字符，直到我看到關於換行符的評論時，我感到非常困惑，所以謝謝！
計算現在似乎正常工作，但出於某種原因，我的打印語句在我的標題之前產生了一個前導「空格」字符。有什麼想法嗎？
'print（（gc_total/seq_length）* 100，'\ n'，header）' – mu0u506a

python的打印會自動在其參數之間放置空格，但這不是在這裏發生的情況，'\ n'領先的空間。我懷疑你的頭文件是由line.rstrip（）定義的，即右邊的條。嘗試使用.strip（）代替。 – seth

可能無法將整個文件保存在內存中。假設你一次不能做s = f.read()，你需要保持一個字母和總字母數的運行計數，直到一個新的序列開始。就像這樣：

file = open("input.txt" , "r") 
# for keeping count: 
char_count = 0 
char_total = 0 
for line in file: 
    if line.startswith(">"): 
     if char_total != 0: 
      # starting a new sequence, so calculate pct for last sequence 
      char_pct = (char_count/char_total) * 100 
      print ('%3.2f' % char_pct) 
      # reset the count for the next sequence 
      char_total = 0 
      char_count = 0 
     print(line.rstrip())   
    else: 
     # continuing a sequence, so add to running counts 
     char_count += line.count('G') + line.count('C') 
     char_total += len(line) 
file.close()

來源

2013-07-10 14:48:49 ChrisP

非常感謝您的幫助ChrisP。我接受了你和Seth的建議，並試圖自己編碼。現在我幾乎想到了！非常感謝大家的幫助和超級快速回復。 – mu0u506a

不是真的直接回答你的問題，但我認爲這是一個更好的方式去！如果你打算在Python中做更多的生物信息學，請看biopython。它會爲你處理fasta文件和其他常見的序列操作（以及更多！）。

所以例如：

from Bio import SeqIO 
from Bio.SeqUtils import GC 

for rec in SeqIO.parse("input.txt", "fasta"): 
    print rec.id,GC(rec.seq)

來源

2013-07-11 15:56:03 Stylize

感謝Stylize，當我做這個「真實」的東西時，我一定會給biopython一個看看。 – mu0u506a

可以解析FASTA格式，並創建與> ID作爲鍵的字典和所述序列的值，如下所示：

from collections import defaultdict 

    def parse_fasta(dataset): 
     "Parses data in FASTA format, returning a dictionary of ID's and values" 
     records = defaultdict(str) 
     record_id = None 
     for line in [l.strip() for l in dataset.splitlines()]: 
      if line.startswith('>'): 
       record_id = line[1:] 
      else: 
       records[record_id] += line 
     return records

或可以重寫這段代碼並創建一個元組/列表。我更喜歡字典，因爲它已經被編入索引。如果你仍然需要幫助，你可以在Rosalind網站上找到我。

來源

2014-02-22 01:40:57

將公式應用到數據行，它跨越多行

回答

相關問題