在Python中,我需要找到矩陣A
中的所有要素與矩陣B
中的所有要素之間的成對相關性。特別是,我發現在A
中給定的特徵具有B
中的所有特徵時,找到最強的Pearson相關性很有趣。我不關心最強的關聯是正面的還是負面的。兩個矩陣特徵的高效配對相關
我做了一個低效的實現,使用兩個循環和下面的scipy。不過,我想用np.corrcoef
或其他類似的方法來有效地計算它。矩陣A
具有形狀40000x400和B
具有形狀40000x1440。我試圖有效地做到這一點可以在下面看到,方法find_max_absolute_corr(A,B)
。但是,它失敗並出現以下錯誤:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
。
import numpy as np
from scipy.stats import pearsonr
def find_max_absolute_corr(A, B):
""" Finds for each feature in `A` the highest Pearson
correlation across all features in `B`. """
max_corr_A = np.zeros((A.shape[1]))
for A_col in range(A.shape[1]):
print "Calculating {}/{}.".format(A_col+1, A.shape[1])
metric = A[:,A_col]
pearson = np.corrcoef(B, metric, rowvar=0)
# takes negative correlations into account as well
min_p = min(pearson)
max_p = max(pearson)
max_corr_A[A_col] = max_absolute(min_p, max_p)
return max_corr_A
def max_absolute(min_p, max_p):
if np.isnan(min_p) or np.isnan(max_p):
raise ValueError("NaN correlation.")
if abs(max_p) > abs(min_p):
return max_p
else:
return min_p
if __name__ == '__main__':
A = np.array(
[[10, 8.04, 9.14, 7.46],
[8, 6.95, 8.14, 6.77],
[13, 7.58, 8.74, 12.74],
[9, 8.81, 8.77, 7.11],
[11, 8.33, 9.26, 7.81]])
B = np.array(
[[-14, -9.96, 8.10, 8.84, 8, 7.04],
[-6, -7.24, 6.13, 6.08, 5, 5.25],
[-4, -4.26, 3.10, 5.39, 8, 5.56],
[-12, -10.84, 9.13, 8.15, 5, 7.91],
[-7, -4.82, 7.26, 6.42, 8, 6.89]])
# simple, inefficient method
for A_col in range(A.shape[1]):
high_corr = 0
for B_col in range(B.shape[1]):
corr,_ = pearsonr(A[:,A_col], B[:,B_col])
high_corr = max_absolute(high_corr, corr)
print high_corr
# -0.161314601631
# 0.956781516149
# 0.621071009239
# -0.421539304112
# efficient method
max_corr_A = find_max_absolute_corr(A, B)
print max_corr_A
# [-0.161314601631,
# 0.956781516149,
# 0.621071009239,
# -0.421539304112]
因爲所有的都是'8s',所以你可以把'corr'當作'NaN'作爲B [:,5]。所以,當你做'max_absolute'時,它會給出'NaN'作爲輸出,而不是實際的'max'。如果你在每次運行'pearsonr'後打印corr',你可能會看到。那麼,你不應該有條件地逃避這種情況嗎? – Divakar
感謝您發現。這在我的真實數據集中不會成爲問題,但我已經添加了條件轉義。我也改變了這個例子,因爲這不應該成爲焦點。 – pir