2015-11-11 66 views
4

在Python中,我需要找到矩陣A中的所有要素與矩陣B中的所有要素之間的成對相關性。特別是,我發現在A中給定的特徵具有B中的所有特徵時,找到最強的Pearson相關性很有趣。我不關心最強的關聯是正面的還是負面的。兩個矩陣特徵的高效配對相關

我做了一個低效的實現,使用兩個循環和下面的scipy。不過,我想用np.corrcoef或其他類似的方法來有效地計算它。矩陣A具有形狀40000x400和B具有形狀40000x1440。我試圖有效地做到這一點可以在下面看到,方法find_max_absolute_corr(A,B)。但是,它失敗並出現以下錯誤:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

import numpy as np 
from scipy.stats import pearsonr 


def find_max_absolute_corr(A, B): 
    """ Finds for each feature in `A` the highest Pearson 
     correlation across all features in `B`. """ 

    max_corr_A = np.zeros((A.shape[1]))  

    for A_col in range(A.shape[1]): 
     print "Calculating {}/{}.".format(A_col+1, A.shape[1]) 

     metric = A[:,A_col] 
     pearson = np.corrcoef(B, metric, rowvar=0) 

     # takes negative correlations into account as well 
     min_p = min(pearson) 
     max_p = max(pearson) 
     max_corr_A[A_col] = max_absolute(min_p, max_p) 

    return max_corr_A 


def max_absolute(min_p, max_p): 
    if np.isnan(min_p) or np.isnan(max_p): 
     raise ValueError("NaN correlation.") 
    if abs(max_p) > abs(min_p): 
     return max_p 
    else: 
     return min_p 


if __name__ == '__main__': 

    A = np.array(
     [[10, 8.04, 9.14, 7.46], 
     [8, 6.95, 8.14, 6.77], 
     [13, 7.58, 8.74, 12.74], 
     [9, 8.81, 8.77, 7.11], 
     [11, 8.33, 9.26, 7.81]]) 

    B = np.array(
     [[-14, -9.96, 8.10, 8.84, 8, 7.04], 
     [-6, -7.24, 6.13, 6.08, 5, 5.25], 
     [-4, -4.26, 3.10, 5.39, 8, 5.56], 
     [-12, -10.84, 9.13, 8.15, 5, 7.91], 
     [-7, -4.82, 7.26, 6.42, 8, 6.89]]) 

    # simple, inefficient method 
    for A_col in range(A.shape[1]): 
     high_corr = 0 
     for B_col in range(B.shape[1]): 
      corr,_ = pearsonr(A[:,A_col], B[:,B_col]) 
      high_corr = max_absolute(high_corr, corr) 
     print high_corr 

    # -0.161314601631 
    # 0.956781516149 
    # 0.621071009239 
    # -0.421539304112   

    # efficient method 
    max_corr_A = find_max_absolute_corr(A, B) 
    print max_corr_A 

    # [-0.161314601631, 
    # 0.956781516149, 
    # 0.621071009239, 
    # -0.421539304112] 
+1

因爲所有的都是'8s',所以你可以把'corr'當作'NaN'作爲B [:,5]。所以,當你做'max_absolute'時,它會給出'NaN'作爲輸出,而不是實際的'max'。如果你在每次運行'pearsonr'後打印corr',你可能會看到。那麼,你不應該有條件地逃避這種情況嗎? – Divakar

+0

感謝您發現。這在我的真實數據集中不會成爲問題,但我已經添加了條件轉義。我也改變了這個例子,因爲這不應該成爲焦點。 – pir

回答

4

似乎scipy.stats.pearsonr如下從A & B施加在列方向對Pearson相關係數式這個定義 -

enter image description here

基於該公式,可以容易地向量化作爲成對計算來自AB的列是彼此獨立的。下面是使用broadcasting一個矢量化溶液 -

# Get number of rows in either A or B 
N = B.shape[0] 

# Store columnw-wise in A and B, as they would be used at few places 
sA = A.sum(0) 
sB = B.sum(0) 

# Basically there are four parts in the formula. We would compute them one-by-one 
p1 = N*np.einsum('ij,ik->kj',A,B) 
p2 = sA*sB[:,None] 
p3 = N*((B**2).sum(0)) - (sB**2) 
p4 = N*((A**2).sum(0)) - (sA**2) 

# Finally compute Pearson Correlation Coefficient as 2D array 
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None])) 

# Get the element corresponding to absolute argmax along the columns 
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])] 

樣品運行 -

1)輸入:

In [12]: A 
Out[12]: 
array([[ 10. , 8.04, 9.14, 7.46], 
     [ 8. , 6.95, 8.14, 6.77], 
     [ 13. , 7.58, 8.74, 12.74], 
     [ 9. , 8.81, 8.77, 7.11], 
     [ 11. , 8.33, 9.26, 7.81]]) 

In [13]: B 
Out[13]: 
array([[-14. , -9.96, 8.1 , 8.84, 8. , 7.04], 
     [ -6. , -7.24, 6.13, 6.08, 5. , 5.25], 
     [ -4. , -4.26, 3.1 , 5.39, 8. , 5.56], 
     [-12. , -10.84, 9.13, 8.15, 5. , 7.91], 
     [ -7. , -4.82, 7.26, 6.42, 8. , 6.89]]) 

2)原始的多圈的代碼運行 -

In [14]: high_corr_out = np.zeros(A.shape[1]) 
    ...: for A_col in range(A.shape[1]): 
    ...:  high_corr = 0 
    ...:  for B_col in range(B.shape[1]): 
    ...:   corr,_ = pearsonr(A[:,A_col], B[:,B_col]) 
    ...:   high_corr = max_absolute(high_corr, corr) 
    ...:  high_corr_out[A_col] = high_corr 
    ...:  

In [15]: high_corr_out 
Out[15]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779]) 

3)提出的碼運行 -

In [16]: N = B.shape[0] 
    ...: sA = A.sum(0) 
    ...: sB = B.sum(0) 
    ...: p1 = N*np.einsum('ij,ik->kj',A,B) 
    ...: p2 = sA*sB[:,None] 
    ...: p3 = N*((B**2).sum(0)) - (sB**2) 
    ...: p4 = N*((A**2).sum(0)) - (sA**2) 
    ...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None])) 
    ...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])] 
    ...: 

In [17]: pcorr # Pearson Correlation Coefficient array 
Out[17]: 
array([[ 0.41895565, -0.5910935 , -0.40465987, 0.5818286 ], 
     [ 0.66609445, -0.41950457, 0.02450215, 0.64028344], 
     [-0.64953314, 0.65669916, 0.30836196, -0.85127779], 
     [-0.41917583, 0.59043266, 0.40364532, -0.58144102], 
     [ 0.8067843 , 0.07947386, 0.74016181, 0.53165395], 
     [-0.1613146 , 0.95678152, 0.62107101, -0.4215393 ]]) 

In [18]: out # elements corresponding to absolute argmax along columns 
Out[18]: array([ 0.8067843 , 0.95678152, 0.74016181, -0.85127779]) 

運行測試 -

In [36]: A = np.random.rand(4000,40) 

In [37]: B = np.random.rand(4000,144) 

In [38]: np.allclose(org_app(A,B),proposed_app(A,B)) 
Out[38]: True 

In [39]: %timeit org_app(A,B) # Original approach 
1 loops, best of 3: 1.35 s per loop 

In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach 
10 loops, best of 3: 39.1 ms per loop 
+0

未來的用戶應該注意,如果你有任何具有相同元素的列(即方差= 0),那麼這將產生錯誤的結果。它會產生一個numpy警告。 – pir

1

添加到從個人的經驗上面的回答,

p1 = N*np.dot(B.T,A) 

相比

p1 = N*np.einsum('ij,ik->kj',A,B) 

當我的方式更快的工作這是特別當A和B是大維矩陣時爲真。

+0

您可以使用代碼格式來更好地閱讀答案。 – jogo