等位基因形式計數和刪除純合系

我有這樣的文件，等位基因形式計數和刪除純合系

bob  NULL 0 A A G G G G G 
tom  NULL 0 A A A A A A A 
sara NULL 0 C C C C T T T 
jane NULL 0 failed failed failed failed failed failed failed

我需要計數A/C，C/A，A/T，T/A，A/G，G/A，C/G，G/C，C/T，T/C，T/G，G/T，並刪除所有純合系，所以我的期望輸出看起來像這樣，

bob  NULL 0 A A G G G G G G/A 
sara NULL 0 C C C C T T T C/T

這是我的嘗試，

fileA = open("myfile.txt",'r') 
import re 
#fileA.next() 
lines=fileA.readlines() 
for line in lines: 
    new_list=re.split(r'\t+',line.strip()) 
    snp_name=new_list[0] 
    allele=new_list[3:] 
    failed_count = allele.count('failed') 
    A_count = allele.count('A') 
    C_count = allele.count('C') 
    G_count = allele.count('G') 
    T_count = allele.count('T') 
#A/C OR C/A count 
    if A_count > 0: 
    if C_count > 0: 
     if A_count > C_count: 
     new_list.append('A/C') 
     else: 
     new_list.append('C/A') 
#A/T OR T/A count 
    if T_count > 0: 
     if A_count > T_count: 
     new_list.append('A/T') 
     else: 
     new_list.append('T/A') 
#A/G OR G/A count 
    if G_count > 0: 
     if A_count > G_count: 
     new_list.append('A/G') 
     else: 
     new_list.append('G/A') 
#C/G OR G/C count 
    if C_count > 0: 
    if G_count > 0: 
     if C_count > G_count: 
     new_list.append('C/G') 
     else: 
     new_list.append('G/C') 
#C/T OR T/C count 
    if T_count > 0: 
     if C_count > T_count: 
     new_list.append('C/T') 
     else: 
     new_list.append('T/C') 
#T/G OR G/T count 
    if T_count > 0: 
    if G_count > 0: 
     if T_count > G_count: 
     new_list.append('T/G') 
     else: 
     new_list.append('G/T') 
    r=open('allele_counts.txt', 'a') 
    x='\t'.join(new_list) 
    x=x+'\n' 
    r.writelines(x) 
fileA.close() 
r.close()

你能否告訴我如何提高t他編碼並刪除所有純合子線？

來源

2015-12-01 user3224522

你爲什麼做'str（A_count）'然後'A_count>'0''？ – MaTh

是的，你是對的。我將編輯它 – user3224522

你也可以添加'elsif'而不是'if's，因爲如果你做'追加'你不需要檢查每個條件（如果我明白了嗎？） – MaTh

的問題可能來自你怎麼寫你的文件，你需要確保與實際tabs 分開的欄目代碼工作正常，我當我編輯myfile.txt 的問題是，前面的列表中，你計數'A'像：

['bob  NULL 0 A A G G G G G']

你需要的是：

['bob', 'NULL', '0', 'A', 'A', 'G', 'G', 'G', 'G', 'G']

來源

2015-12-01 12:05:50 MaTh

我檢查了我的new_list，它是一個字符串..是製表符分隔的 – user3224522

當您執行'print new_list'時，您會得到第一種還是第二種？ – MaTh

我得到第二個和程序輸出我正確的數 – user3224522

也許這個重構可以幫助：

import re 
from collections import Counter 
from operator import itemgetter 

# Use with so that you don't forget to close the file in the end. Also, it is 
# more pythonic 
with open("myfile.txt",'r') as fileA: 
    with open('allele_counts.txt', 'a') as fileB: 
     # The file object is in itself an iterator, so you can iterate over it 
     for line in fileA: 
      new_list = re.split(r'\t+',line.strip()) 
      allele = new_list[3:] 
      failed_count = allele.count('failed') 

      # Use python's counter object to do the counting 
      counts = Counter(allele) 
      # Get the top two most common strings. This returns a list of 
      # tuples with item and its count 
      top_two = counts.most_common(2) 
      # We only need the item, so pluck that out from the list 
      classification = '/'.join(map(itemgetter(0), top_two)) 

      # Add our classification to the new output list 
      new_list.append(classification) 
      # write to file 
      fileB.write('\t'.join(new_list))

來源

2015-12-01 12:21:09 zsquare

這個劇本的作品，唯一的是，我有幾個其他字母的行，但我有興趣計算這些4和。我從來沒有同一行中的ACT，它總是2個我感興趣的字母，但也有雜合子，如Y，R，K，我不想計算它們，即使它們是最常見的。希望我能解釋 – user3224522

的另一種方法是使用熊貓數據幀：

import pandas as pd 

df = pd.read_table('myfile.txt', header=None, sep=" ", skipinitialspace=True) 

select = ['A', 'G', 'C', 'T', 'failed'] 

# select out all the heterozygous rows 
for elem in select: 
    df = df[(df.iloc[:,3:10] != elem).any(axis=1)] 

# reset the index since we removed rows 
df = df.reset_index(drop=True) 
df[10] = '' # column 10 will hold the tags 

# add the tag to the line in the form A/B where count('A') > count('B') for a row 
for i in range(df.shape[0]): 
    tags = df.iloc[i, 3:10].unique().tolist() 
    if sum(df.iloc[i, 3:10] == tags[0]) < sum(df.iloc[i, 3:10] == tags[1]): 
     tags.reverse() 
    df.iloc[i, 10] = '/'.join(tags) 

df.to_csv('allele_counts.txt', sep=" ", header=False, index=False, na_rep='NULL')

當我運行它使用myfile.txt中，我得到以下allel_counts.txt：

bob NULL 0 A A G G G G G G/A 
sara NULL 0 C C C C T T T C/T

來源

2015-12-01 13:39:04

等位基因形式計數和刪除純合系

回答

相關問題