2015-12-01 60 views
0

我有這樣的文件,等位基因形式計數和刪除純合系

bob  NULL 0 A A G G G G G 
tom  NULL 0 A A A A A A A 
sara NULL 0 C C C C T T T 
jane NULL 0 failed failed failed failed failed failed failed 

我需要計數A/C,C/A,A/T,T/A,A/G,G/A,C/G,G/C,C/T,T/C,T/G,G/T,並刪除所有純合系,所以我的期望輸出看起來像這樣,

bob  NULL 0 A A G G G G G G/A 
sara NULL 0 C C C C T T T C/T 

這是我的嘗試,

fileA = open("myfile.txt",'r') 
import re 
#fileA.next() 
lines=fileA.readlines() 
for line in lines: 
    new_list=re.split(r'\t+',line.strip()) 
    snp_name=new_list[0] 
    allele=new_list[3:] 
    failed_count = allele.count('failed') 
    A_count = allele.count('A') 
    C_count = allele.count('C') 
    G_count = allele.count('G') 
    T_count = allele.count('T') 
#A/C OR C/A count 
    if A_count > 0: 
    if C_count > 0: 
     if A_count > C_count: 
     new_list.append('A/C') 
     else: 
     new_list.append('C/A') 
#A/T OR T/A count 
    if T_count > 0: 
     if A_count > T_count: 
     new_list.append('A/T') 
     else: 
     new_list.append('T/A') 
#A/G OR G/A count 
    if G_count > 0: 
     if A_count > G_count: 
     new_list.append('A/G') 
     else: 
     new_list.append('G/A') 
#C/G OR G/C count 
    if C_count > 0: 
    if G_count > 0: 
     if C_count > G_count: 
     new_list.append('C/G') 
     else: 
     new_list.append('G/C') 
#C/T OR T/C count 
    if T_count > 0: 
     if C_count > T_count: 
     new_list.append('C/T') 
     else: 
     new_list.append('T/C') 
#T/G OR G/T count 
    if T_count > 0: 
    if G_count > 0: 
     if T_count > G_count: 
     new_list.append('T/G') 
     else: 
     new_list.append('G/T') 
    r=open('allele_counts.txt', 'a') 
    x='\t'.join(new_list) 
    x=x+'\n' 
    r.writelines(x) 
fileA.close() 
r.close() 

你能否告訴我如何提高t他編碼並刪除所有純合子線?

+1

你爲什麼做'str(A_count)'然後'A_count>'0''? – MaTh

+0

是的,你是對的。我將編輯它 – user3224522

+0

你也可以添加'elsif'而不是'if's,因爲如果你做'追加'你不需要檢查每個條件(如果我明白了嗎?) – MaTh

回答

0

的問題可能來自你怎麼寫你的文件,你需要確保與實際tabs 分開的欄目代碼工作正常,我當我編輯myfile.txt 的問題是,前面的列表中,你計數'A'像:

['bob  NULL 0 A A G G G G G'] 

你需要的是:

['bob', 'NULL', '0', 'A', 'A', 'G', 'G', 'G', 'G', 'G'] 
+0

我檢查了我的new_list,它是一個字符串..是製表符分隔的 – user3224522

+0

當您執行'print new_list'時,您會得到第一種還是第二種? – MaTh

+0

我得到第二個和程序輸出我正確的數 – user3224522

0

也許這個重構可以幫助:

import re 
from collections import Counter 
from operator import itemgetter 

# Use with so that you don't forget to close the file in the end. Also, it is 
# more pythonic 
with open("myfile.txt",'r') as fileA: 
    with open('allele_counts.txt', 'a') as fileB: 
     # The file object is in itself an iterator, so you can iterate over it 
     for line in fileA: 
      new_list = re.split(r'\t+',line.strip()) 
      allele = new_list[3:] 
      failed_count = allele.count('failed') 

      # Use python's counter object to do the counting 
      counts = Counter(allele) 
      # Get the top two most common strings. This returns a list of 
      # tuples with item and its count 
      top_two = counts.most_common(2) 
      # We only need the item, so pluck that out from the list 
      classification = '/'.join(map(itemgetter(0), top_two)) 

      # Add our classification to the new output list 
      new_list.append(classification) 
      # write to file 
      fileB.write('\t'.join(new_list)) 
+0

這個劇本的作品,唯一的是,我有幾個其他字母的行,但我有興趣計算這些4和。我從來沒有同一行中的ACT,它總是2個我感興趣的字母,但也有雜合子,如Y,R,K,我不想計算它們,即使它們是最常見的。希望我能解釋 – user3224522

0

的另一種方法是使用熊貓數據幀:

import pandas as pd 

df = pd.read_table('myfile.txt', header=None, sep=" ", skipinitialspace=True) 

select = ['A', 'G', 'C', 'T', 'failed'] 

# select out all the heterozygous rows 
for elem in select: 
    df = df[(df.iloc[:,3:10] != elem).any(axis=1)] 

# reset the index since we removed rows 
df = df.reset_index(drop=True) 
df[10] = '' # column 10 will hold the tags 

# add the tag to the line in the form A/B where count('A') > count('B') for a row 
for i in range(df.shape[0]): 
    tags = df.iloc[i, 3:10].unique().tolist() 
    if sum(df.iloc[i, 3:10] == tags[0]) < sum(df.iloc[i, 3:10] == tags[1]): 
     tags.reverse() 
    df.iloc[i, 10] = '/'.join(tags) 

df.to_csv('allele_counts.txt', sep=" ", header=False, index=False, na_rep='NULL') 

當我運行它使用myfile.txt中,我得到以下allel_counts.txt:

bob NULL 0 A A G G G G G G/A 
sara NULL 0 C C C C T T T C/T