2014-09-23 14 views
1

背景:在MEME file formatMEME文件格式序列PWM的熵?

我有DNA基序表示爲position-weight-matrices (PWMs)。以下是取自Here的示例(有關詳細說明,另請參閱Here)。

MOTIF c585_n005_AGCWTAATC_d_0.034 

letter-probability matrix: alength= 4 w= 10 nsites= 1000 E= 0.05 
0.4000 0.2000 0.2000 0.2000 
0.0000 0.0000 1.0000 0.0000 
1.0000 0.0000 0.0000 0.0000 
0.0000 0.0000 0.0000 1.0000 
0.0000 0.0000 0.1130 0.8870 
1.0000 0.0000 0.0000 0.0000 
0.9217 0.0000 0.0000 0.0783 
0.0000 0.0000 1.0000 0.0000 
0.0000 1.0000 0.0000 0.0000 
0.0500 0.0500 0.0500 0.8500 

MOTIF c101_n004_ATCARTACA_d_0.042 

letter-probability matrix: alength= 4 w= 9 nsites= 1000 E= 0.05 

1.0000 0.0000 0.0000 0.0000 
0.0000 0.0000 0.0402 0.9598 
0.0000 1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 0.0000 
0.2696 0.0000 0.7304 0.0000 
0.0000 0.0000 0.0000 1.0000 
1.0000 0.0000 0.0000 0.0000 
0.0000 1.0000 0.0000 0.0000 
1.0000 0.0000 0.0000 0.0000 

問:

用一個簡單的腳本,我怎麼能計算Shannon entropy每個主題?

我搜索了我熟悉的語言(Python,Perl,MATLAB)的腳本,但未能發現我能理解。對於如何以一種簡單的方式或使用Biopython/Bioperl/MATLAB-toolbox中提供的工具以編程方式提供建議將不勝感激。

回答

2

試試這個:

def entropy(X, Y): 
    import numpy as np 
    probs = [] 
    for c1 in set(X): 
     for c2 in set(Y): 
      probs.append(np.mean(np.logical_and(X == c1, Y == c2))) 
    return np.sum(-p * np.log2(p) for p in probs) 

def calcShannonEnt(dataSet): 
    import math 
    numEntries = len(dataSet) 
    labelCounts = {} 
    for featVec in dataSet: #the the number of unique elements and their occurance 
     currentLabel = featVec[-1] 
     if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 
     labelCounts[currentLabel] += 1 
    shannonEnt = 0.0 
    for key in labelCounts: 
     prob = float(labelCounts[key])/numEntries 
     shannonEnt -= prob * math.log(prob,3) 
    return shannonEnt 

dataSet = [[0, 1, 1, 'a'], 
    [0, 1, 1, 'b'], 
    [0, 1, 1, 'c'], 
    [0, 1, 1, 'c'], 
    [1, 0, 1, 'c']] 

a = calcShannonEnt(dataSet) 

from cogent import LoadSeqs 
from cogent.parse.meme import MemeParser 

print a