pandas spearman相關性很奇怪嗎？

大熊貓版本0.18.1

from pandas import Series 
a = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Crystal Palace'] 
b = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Man United'] 
c = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'Man United'] 
d = ['Arsenal', 'Leicester', 'Man City', 'Tottenham', 'West Ham'] 


Series(a).corr(Series(b), method="spearman") 
0.69999999999999996 
Series(c).corr(Series(d), method="spearman") 
0.8999999999999998

來源

2017-01-20 Tales Tenorio Pimentel

python 3.5.2和anaconda 4.4.1 –

熊貓必須以某種方式排列這些字符串，因此它們按字母順序排列。因此，根據其他球隊的存在情況，球隊的排名可能會有所不同。所以熊貓正在計算「正確」，但這不是你想要的操作。 –

我不是統計學家，但不需要在兩個數字系列上完成相關性？你期待什麼作爲輸出？在熊貓0.19.2上面的示例代碼崩潰，因爲字符串不是浮動的。 – nico

這是預期的行爲。 Spearman Correlation是排名相關性，意味着它是在您的數據的排名上執行的，而不是數據本身。在您的示例中，數據本身可能僅在一個位置有所不同，但數據中的差異會產生不同的排名。正如評論中所建議的那樣，Spearman相關性可能不是您真正想要使用的。

爲了進一步擴大，在大熊貓底下的熊貓基本上呼籲scipy.stats.spearmanr來計算相關性。縱觀source code爲spearmanr，它本質上結束了使用scipy.stats.rankdata進行排名，然後np.corrcoef得到相關：

corr1 = np.corrcoef(ss.rankdata(a), ss.rankdata(b))[1,0] 
corr2 = np.corrcoef(ss.rankdata(c), ss.rankdata(d))[1,0]

將會產生您看到的是相同的值。現在，看看在每個相關計算中使用的排名：

ss.rankdata(a) 
[ 1. 3. 4. 5. 2.] 

ss.rankdata(b) 
[ 1. 2. 3. 5. 4.] 

ss.rankdata(c) 
[ 1. 2. 3. 5. 4.] 

ss.rankdata(d) 
[ 1. 2. 3. 4. 5.]

注意，對於a和b排名在三個地點不同，相比排名c和d在兩個位置不同，所以我們期望所得到的相關性會有所不同。

來源

2017-01-20 21:51:15 root

pandas spearman相關性很奇怪嗎？

回答

相關問題