使用python的stats.kendalltau函數

我想測量兩個Conference相關度量（AcceptanceRate和）之間的相關性。我有以下兩個DataFrames（已經下令/等級相應）：使用python的stats.kendalltau函數

df_if：

     Conference FiveYrIF 
0    SIGMOD Conference 112.685585 
1       KDD 103.674543 
2       CHI 99.453096 
3       SIGIR 68.967753 
4       WWW 65.715631 
5       SODA 60.151959 
6       DAC 42.076365 
7       ICCAD 39.906361 
8       CIKM 33.232224 
9       DATE 26.578906 
10      INFOCOM 22.694122 
11 Winter Simulation Conference 17.448830 
12       SAC 10.646007

df_ar：

     Conference AcceptanceRate 
0       CIKM    15 
1       SIGIR    16 
2      INFOCOM   19.7 
3       KDD    21 
4       DAC    22 
5       DATE    23 
6       WWW    24 
7       CHI    25 
8       ICCAD    27 
9    SIGMOD Conference    27 
10       SAC    29 
11       SODA   29.5 
12 Winter Simulation Conference    54

我想這兩個指標比較（和AcceptanceRates）使用我以前使用的stats.kendalltau方法，但使用年份（數字）的排名而不是使用會議（文本）的排名，如此處所示。

我試過如下：

from scipy.stats import kendalltau 

kendalltau(df_if['Conference'].values, df_ar['Conference'].values)

但它返回以下錯誤：

TypeError: merge sort not available for item 0

我不太清楚我在做什麼錯了，這是我的理解是什麼，我我的比較只是序數（有序）而不是可比數字。我們比較訂單，不是嗎？

我試圖避免必須返回到數據庫併爲每個會議設置某種數字ID，以便我可以在可能的情況下執行此操作。

來源

2015-09-09 BKS

請添加完整的追溯，而不僅僅是錯誤描述。 – cel

顯然kendalltau不處理Pandas使用的對象數組。您可以在將它傳遞到kendalltau之前將其轉換爲字符串數組來解決此問題。

例如，這裏有一個數據幀：

In [107]: df 
Out[107]: 
    x y 
0 aaa 0.5 
1 bb 1.4 
2 c 1.3 
3 d 2.0 
4 ee 2.1

在x列中的值是字符串。熊貓代表字符串數組與數據類型object數組：

In [108]: df['x'] 
Out[108]: 
0 aaa 
1  bb 
2  c 
3  d 
4  ee 
Name: x, dtype: object 

In [109]: df['x'].values 
Out[109]: array(['aaa', 'bb', 'c', 'd', 'ee'], dtype=object)

kendalltau不處理這樣的一個數組：

In [110]: kendalltau(df['x'], df['y']) 
--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-110-07ca97e866e2> in <module>() 
----> 1 kendalltau(df['x'], df['y']) 

/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort) 
    3020  if initial_lexsort: 
    3021   # sort implemented as mergesort, worst case: O(n log(n)) 
-> 3022   perm = np.lexsort((y, x)) 
    3023  else: 
    3024   # sort implemented as quicksort, 30% faster but with worst case: O(n^2) 

TypeError: merge sort not available for item 1 

In [111]: kendalltau(df['x'].values, df['y']) 
--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-111-e903a3b3475e> in <module>() 
----> 1 kendalltau(df['x'].values, df['y']) 

/Users/warren/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in kendalltau(x, y, initial_lexsort) 
    3020  if initial_lexsort: 
    3021   # sort implemented as mergesort, worst case: O(n log(n)) 
-> 3022   perm = np.lexsort((y, x)) 
    3023  else: 
    3024   # sort implemented as quicksort, 30% faster but with worst case: O(n^2) 

TypeError: merge sort not available for item 1

，如果你的數組轉換成字符串數組它的工作原理，使用df['x'].values.astype(str) ：

In [112]: kendalltau(df['x'].values.astype(str), df['y']) 
Out[112]: KendalltauResult(correlation=0.79999999999999982, pvalue=0.050043527347496564)

來源

2015-09-09 12:45:30

謝謝！像魅力一樣工作。 :) – BKS

使用python的stats.kendalltau函數

回答

相關問題