2015-01-10 89 views
0

我有一個包含12列的表,並希望根據第二列(sseqid)選擇第一列中的項(qseqid)。這意味着第二列(sseqid)在第11列和第12列中分別以不同的值重複,分別爲evaluebitscore。 ,我想獲得的是那些具有最低evalue最高bitscore(當evalue可相同,列的其餘部分可以被忽略,並且該數據低於下來)。Python通過使用字典比較表中的值來選擇項目

所以,我做了一個簡短的代碼,它使用第二列作爲字典的關鍵。我可以從列表qseqid + evalueqseqid + bitscore獲得第二列的五個不同項目。

下面是代碼:

#!usr/bin/python 

filename = "data.txt" 

readfile = open(filename,"r") 

d = dict() 

for i in readfile.readlines(): 
    i = i.strip() 
    i = i.split("\t") 
    d.setdefault(i[1], []).append([i[0],i[10]]) 
    d.setdefault(i[1], []).append([i[0],i[11]]) 

for x in d: 
    print(x,d[x]) 

readfile.close() 

但是,我掙扎着爬了qseqid最低的安勤和每個sseqid最高bitscore。 有沒有什麼好的邏輯來解決這個問題?

data.txt文件(包括標題行,並與»代表製表符)

qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore 
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240 
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729 
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357 
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704 
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353 
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6 
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2 
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2 
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155 
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801 
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2 
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1 
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2 
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734 
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691 
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365 
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4 
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176 
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2 
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732 

而這裏的表的內容多一點閱讀的版本:

0   1   2  3  4  5  6 7  8 9  10  11 
qseqid  sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore 
ACLA_022040 TBB  32.71 431  258  8 39 468  24 423 2.00E-76  240 
ACLA_024600 TBB  80 435  87  0  1 435  1 435   0  729 
ACLA_031860 TBB  39.74 453  251  3  1 447  1 437 1.00E-121  357 
ACLA_046030 TBB  75.81 434  105  0  1 434  1 434   0  704 
ACLA_072490 TBB  41.7 446  245  3  4 447  3 435 2.00E-120  353 
ACLA_010400 EF1A 27.31 249  127  8 69 286  9 234 3.00E-13  61.6 
ACLA_015630 EF1A  22 491  255  17 186 602  3 439 8.00E-19  78.2 
ACLA_016510 EF1A 26.23 122  61  4 21 127  9 116 2.00E-08  46.2 
ACLA_023300 EF1A 29.31 447  249  12 48 437  3 439 2.00E-45  155 
ACLA_028450 EF1A 85.55 443  63  1  1 443  1 442   0  801 
ACLA_074730 CALM 23.13 147  101  4  6 143  2 145 7.00E-08  41.2 
ACLA_096170 CALM 29.33 150  96  4 34 179  2 145 1.00E-13  55.1 
ACLA_016630 CALM  23.9 159  106  5 58 216  4 147 5.00E-12  51.2 
ACLA_031930 RPB2 36.87 1226  633  24 121 1237  26 1219   0  734 
ACLA_065630 RPB2 65.79 1257  386  14  1 1252  4 1221   0  1691 
ACLA_082370 RPB2 27.69 1228  667  37 31 1132  35 1167 7.00E-110  365 
ACLA_061960 ACT  28.57 147  95  5 146 284  69 213 3.00E-12  57.4 
ACLA_068200 ACT  28.73 463  231  13 16 471  4 374 1.00E-53  176 
ACLA_069960 ACT  24.11 141  97  4 581 718 242 375 9.00E-09  46.2 
ACLA_095800 ACT  91.73 375  31  0  1 375  1 375   0  732 
+2

什麼是預期的結果? – thefourtheye

回答

2
#!usr/bin/python 
import csv 

DATA = "data.txt" 

class Sequence: 
    def __init__(self, row): 
     self.qseqid =  row[0] 
     self.sseqid =  row[1] 
     self.pident = float(row[2]) 
     self.length = int(row[3]) 
     self.mismatch = int(row[4]) 
     self.gapopen = int(row[5]) 
     self.qstart = int(row[6]) 
     self.qend  = int(row[7]) 
     self.sstart = int(row[8]) 
     self.send  = int(row[9]) 
     self.evalue = float(row[10]) 
     self.bitscore = float(row[11]) 

    def __str__(self): 
     return (
      "{qseqid}\t" 
      "{sseqid}\t" 
      "{pident}\t" 
      "{length}\t" 
      "{mismatch}\t" 
      "{gapopen}\t" 
      "{qstart}\t" 
      "{qend}\t" 
      "{sstart}\t" 
      "{send}\t" 
      "{evalue}\t" 
      "{bitscore}" 
     ).format(**self.__dict__) 

def entries(fname, header_rows=1, dtype=list, **kwargs): 
    with open(fname) as inf: 
     incsv = csv.reader(inf, **kwargs) 

     # skip header rows 
     for i in range(header_rows): 
      next(incsv) 

     for row in incsv: 
      yield dtype(row) 

def main(): 
    bestseq = {} 
    for seq in entries(DATA, dtype=Sequence, delimiter="\t"): 
     # see if a sequence with the same sseqid already exists 
     prev = bestseq.get(seq.sseqid, None) 

     if (
      prev is None 
      or seq.evalue < prev.evalue 
      or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore) 
     ): 
      bestseq[seq.sseqid] = seq 

    # display selected sequences 
    keys = sorted(bestseq) 
    for key in keys: 
     print(bestseq[key]) 

if __name__ == "__main__": 
    main() 

導致在

ACLA_095800  ACT  91.73 375  31  0  1  375  1  375  0.0  732.0 
ACLA_096170  CALM 29.33 150  96  4  34  179  2  145  1e-13 55.1 
ACLA_028450  EF1A 85.55 443  63  1  1  443  1  442  0.0  801.0 
ACLA_065630  RPB2 65.79 1257 386  14  1  1252 4  1221 0.0  1691.0 
ACLA_024600  TBB  80.0 435  87  0  1  435  1  435  0.0  729.0 
0
filename = 'data.txt' 

readfile = open(filename,'r') 

d = dict() 
sseqid=[] 
lines=[] 
for i in readfile.readlines(): 
    sseqid.append(i.rsplit()[1]) 
    lines.append(i.rsplit()) 

sorted_sseqid = sorted(set(sseqid)) 

sdqDict={} 
key =None 

for sorted_ssqd in sorted_sseqid: 

    key=sorted_ssqd 
    evalue=[] 
    bitscore=[] 
    qseid=[] 
    for line in lines: 
     if key in line: 
      evalue.append(line[10]) 
      bitscore.append(line[11]) 
      qseid.append(line[0]) 
    sdqDict[key]=[qseid,evalue,bitscore] 

print sdqDict 

print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1]) 
##I think you can do the list manipulation below to find out the qseqid 

readfile.close() 
3

既然你是一個Python新手,我很高興有幾個例子如何手動,但爲了比較,我會展示如何使用pandas庫,它使表格數據簡單得多。

由於您沒有提供示例輸出,因此我假定「每個sseqid的最低evalue和最高bitscore」指的是給定sseqid的「最低evalues中最高的bitscore」如果你想要分開,那也是微不足道的。

import pandas as pd 
df = pd.read_csv("acla1.dat", sep="\t") 
df = df.sort(["evalue", "bitscore"],ascending=[True, False]) 
df_new = df.groupby("sseqid", as_index=False).first() 

產生

>>> df_new 
    sseqid  qseqid pident length mismatch gapopen qstart qend sstart send  evalue bitscore 
0 ACT ACLA_095800 91.73  375  31  0  1 375  1 375 0.000000e+00  732.0 
1 CALM ACLA_096170 29.33  150  96  4  34 179  2 145 1.000000e-13  55.1 
2 EF1A ACLA_028450 85.55  443  63  1  1 443  1 442 0.000000e+00  801.0 
3 RPB2 ACLA_065630 65.79 1257  386  14  1 1252  4 1221 0.000000e+00 1691.0 
4 TBB ACLA_024600 80.00  435  87  0  1 435  1 435 0.000000e+00  729.0 

基本上,我們首先讀取的數據文件轉換成對象稱爲DataFrame,其是一種像Excel工作表。然後我們按evalue升序排序(以便降低evalue秒)和降低bitscore(以便更高的bitscore秒)。然後我們可以使用groupby來收集等於sseqid的數據組中的數據,並將每個組中的第一個數據作爲第一個數據,這是因爲排序會成爲我們想要的數據。

+0

這是一個非常好的解決方案,但熊貓模塊需要許多其他需要首先安裝的模塊。我可以在Linux上使用Python,但不幸的是,我目前正在使用Windows版本,堅持安裝Six模塊。無論如何,很好,謝謝。 – Karyo

2

雖然不像使用pandas庫那樣優雅和簡潔,但很有可能在不訴諸第三方模塊的情況下做你想做的事情。以下使用collections.defaultdict類來幫助創建可變長度記錄列表的字典。 AttrDict類的使用是可選的,但它使訪問每個基於字典的記錄的字段更容易,並且不像通常所需的dict['fieldname']語法那樣難看。

import csv 
from collections import defaultdict, namedtuple 
from itertools import imap 
from operator import itemgetter 

data_file_name = 'data.txt' 
DELIMITER = '\t' 
ssqeid_dict = defaultdict(list) 

# from http://stackoverflow.com/a/1144405/355230 
def multikeysort(items, columns): 
    comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else 
        (itemgetter(col.strip()),  1)) for col in columns] 
    def comparer(left, right): 
     for fn, mult in comparers: 
      result = cmp(fn(left), fn(right)) 
      if result: 
       return mult * result 
     else: 
      return 0 
    return sorted(items, cmp=comparer) 

# from http://stackoverflow.com/a/15109345/355230 
class AttrDict(dict): 
    def __init__(self, *args, **kwargs): 
     super(AttrDict, self).__init__(*args, **kwargs) 
     self.__dict__ = self 

with open(data_file_name, 'rb') as data_file: 
    reader = csv.DictReader(data_file, delimiter=DELIMITER) 
    format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames]) 
    for rec in (AttrDict(r) for r in reader): 
     # Convert the two sort fields to numeric values for proper ordering. 
     rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore)) 
     ssqeid_dict[rec.sseqid].append(rec) 

for ssqeid in sorted(ssqeid_dict): 
    # Sort each group of recs with same ssqeid. The first record after sorting 
    # will be the one sought that has the lowest evalue and highest bitscore. 
    selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0] 
    print format_spec.format(**selected) 

輸出(»代表標籤):

ACLA_095800» ACT» 91.73» 375» 31»  0»  1»  375» 1»  375» 0.0» 732.0 
ACLA_096170» CALM» 29.33» 150» 96»  4»  34»  179» 2»  145» 1e-13» 55.1 
ACLA_028450» EF1A» 85.55» 443» 63»  1»  1»  443» 1»  442» 0.0» 801.0 
ACLA_065630» RPB2» 65.79» 1257» 386» 14»  1»  1252» 4»  1221» 0.0» 1691.0 
ACLA_024600» TBB» 80»  435» 87»  0»  1»  435» 1»  435» 0.0» 729.0 
+0

這很好,但輸出不顯示qsequid。謝謝你解決我的問題。 – Karyo

+0

謝謝。幾乎可笑的是,在你重新格式化你的問題之後,我是那個不瞭解你想要的結果的人。無論如何,我已經在這方面更新了我的答案,現在我明白什麼更好。 – martineau