2015-01-06 26 views
0

數據文件是herepandas中的expanding_corr函數給出NaN

我只是想計算兩個數據幀的列之間兩兩相關性:

In [7]: import os 

In [8]: import pandas as pd 

In [9]: import numpy as np 

In [10]: from pandas import Series, DataFrame 

In [12]: blog_dat = pd.read_table("blogdata.txt", index_col="Blog") 

In [13]: blog_dat = blog_dat.astype(float) 

In [14]: all(blog_dat.notnull()) 
Out[14]: True 

In [15]: x = DataFrame(np.random.randn(99*4).reshape((99, 4))) 

In [16]: pd.expanding_corr(blog_dat.iloc[:, :4], blog_dat.iloc[:, :4], pairwise=True)[-1, :, :] 
Out[16]: 
      china  kids  music  yahoo 
china 1.000000 0.053069 0.026599 0.246957 
kids 0.053069 1.000000 0.409978 0.094636 
music 0.026599 0.409978 1.000000 0.055923 
yahoo 0.246957 0.094636 0.055923 1.000000 

In [17]: pd.expanding_corr(blog_dat.iloc[:, :4], x, pairwise=True)[-1, :, :] 
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1240: RuntimeWarning: unorderable types: str() < int(), sort order is undefined for incomparable objects 
    "incomparable objects" % e, RuntimeWarning) 
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1240: RuntimeWarning: unorderable types: int() < str(), sort order is undefined for incomparable objects 
    "incomparable objects" % e, RuntimeWarning) 
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1254: RuntimeWarning: unorderable types: str() > int(), sort order is undefined for incomparable objects 
    "incomparable objects" % e, RuntimeWarning) 
/usr/local/lib/python3.4/site-packages/pandas/core/index.py:1254: RuntimeWarning: unorderable types: int() > str(), sort order is undefined for incomparable objects 
    "incomparable objects" % e, RuntimeWarning) 
Out[17]: 
     0 1 2 3 
china NaN NaN NaN NaN 
kids NaN NaN NaN NaN 
music NaN NaN NaN NaN 
yahoo NaN NaN NaN NaN 

的NaN的走不走,即使我給索引和列名x

回答

2

xblog_dat具有相同的index

import pandas as pd 
import numpy as np 
np.random.seed(1) 

blog_dat = pd.read_table("data", sep='\s+') 
x = pd.DataFrame(np.random.randn(4*4).reshape((4, 4)), 
       index=blog_dat.index) 

pd.expanding_corr(blog_dat.iloc[:, :4], x, pairwise=True)[-1, :, :] 

產生

   0   1   2   3 
china 0.684896 0.260795 -0.990586 0.281298 
kids 0.077209 -0.871448 0.702822 0.241313 
music -0.203808 0.071436 0.581267 -0.783753 
yahoo -0.630744 0.373339 -0.060623 0.258728 

這不是足以給任何x目錄名稱;他們必須匹配blog_dat的索引。

+0

酷,其實只有索引需要與blog_dat同步。但爲什麼即使這是必要的也超出了我。 – qed

+1

Pandas中的許多操作與索引保持一致。來自兩個Series的數據點的相關性不與整數索引位置相匹配(就像NumPy所做的那樣)。相反,數據點通過索引進行對齊。如果索引不匹配,則數據點完全錯過對方,相關性未知,因此返回NaN。 – unutbu

+0

@qed:感謝您的更正。 – unutbu