scipy.stats.spearmanr的不同結果取決於數據的生成方式

我使用scipy.stats中的spearmanr出現了一些奇怪的問題。我使用多項式的值來獲得一些更有趣的相關性，但如果我手動輸入值（作爲列表，轉換爲numpy數組），我會得到與我得到的不同的相關性如果我使用函數計算值。下面的代碼應該證明我的意思：scipy.stats.spearmanr的不同結果取決於數據的生成方式

import numpy as np 
from scipy.stats import spearmanr  
data = np.array([ 0.4, 1.2, 1. , 0.4, 0. , 0.4, 2.2, 6. , 12.4, 22. ]) 
axis = np.arange(0, 10, dtype=np.float64) 

print(spearmanr(axis, data))# gives a correlation of 0.693... 

# Use this polynomial 
poly = lambda x: 0.1*(x - 3.0)**3 + 0.1*(x - 1.0)**2 - x + 3.0 

data2 = poly(axis) 
print(data2) # It is the same as data 

print(spearmanr(axis, data2))# gives a correlation of 0.729...

我也注意到，該陣列是微妙的不同（即data - data2是不完全爲零的所有元素），但不同的是微小的 - 1E-16的順序。

這麼小的差距足以讓斯巴克爾甩這麼多嗎？

來源

2017-02-21 Theolodus

這麼小的差距足以甩開spearmanr這麼多嗎？

是的，因爲斯皮爾曼的r是基於樣本的排名。這種微小的差異可以更改值的排名，否則將等於：

sp.stats.rankdata(data) 
# array([ 3., 6., 5., 3., 1., 3., 7., 8., 9., 10.]) 
# Note that all three values of 0.4 get the same rank 3. 

sp.stats.rankdata(data2) 
# array([ 2.5, 6. , 5. , 2.5, 1. , 4. , 7. , 8. , 9. , 10. ]) 
# Note that two values 0.4 get the rank 2.5 and one gets 4.

如果添加一個小的梯度，以打破這種關係（比你觀察到的數值差異較大），你會得到相同的結果：

print(spearmanr(axis, data + np.arange(10)*1e-12)) 
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047) 

print(spearmanr(axis, data2 + np.arange(10)*1e-12)) 
# SpearmanrResult(correlation=0.74545454545454537, pvalue=0.013330146315440047)

但是，這會打破任何可能有意爲之的關係，並可能導致相關性過高或過低。如果數據預期具有離散值，則numpy.round可能是優選的解決方案。

來源

2017-02-21 15:41:28 kazemakase

scipy.stats.spearmanr的不同結果取決於數據的生成方式

回答

相關問題