熊貓多指標數據幀，缺失值的ND插值

熊貓可能在多指標數據幀中插值缺失值。下面的示例不按預期方式工作：熊貓多指標數據幀，缺失值的ND插值

arr1=np.array(np.arange(1.,10.,1.)) 
arr2=np.array(np.arange(2.,20.,2.)) 
df1=pd.DataFrame(zip(arr1,arr2,arr1+arr2,arr1*arr2),columns=['x','y','xplusy','xtimesy']) 

df1.set_index(['x','y'],inplace=True) 

df2=df1.reindex(index=zip(*df1.index.levels)+[(2,2),(3,2),(5,5)]) 
df2.sortlevel([0,1],inplace=True) 
df2.interpolate(method='linear',inplace=True)

顯示不是我預期在xplusy和xtimesy列添加索引。

----------- ---- --- 
(1.0, 2.0) 3  2 
(2.0, 2.0) 4.5 5 
(2.0, 4.0) 6  8 
(3.0, 2.0) 7.5 13 
(3.0, 6.0) 9  18 
(4.0, 8.0) 12  32 
(5.0, 5.0) 13.5 41 
(5.0, 10.0) 15  50 
(6.0, 12.0) 18  72 
(7.0, 14.0) 21  98 
(8.0, 16.0) 24 128 
(9.0, 18.0) 27 162 
----------- ---- ---

來源

2015-04-06 denfromufa

所以填充缺失值之前，這是你在第幾行：

df2 

     xplusy xtimesy 
x y     
1 2  3  2 
2 2  NaN  NaN 
    4  6  8

看起來要插值基於多指標的。我不相信有任何方法可以用熊貓插值來做到這一點，但是你可以基於一個簡單的索引來做到這一點（method ='linear'忽略索引btw並且也是默認值，所以不需要指定它）：

df2.reset_index(level=1).interpolate(method='index') 

    y xplusy xtimesy 
x      
1 2  3  2 
2 2  6  8 
2 4  6  8 

df2.reset_index(level=0).interpolate(method='index') 

    x xplusy xtimesy 
y      
2 1  3.0  2 
2 2  3.0  2 
4 2  6.0  8

顯然，在這種情況下，你可以創建多個步驟xplusy和xtimesy（第一個X，則y，然後xplusy和xtimesy），但我不知道這是你真正想要做的事。

無論如何，這是一種1d插值，你可以用熊貓插值很容易地做到這一點。如果這還不夠，你可以看看numpy的interp2d初學者。

來源

2015-04-06 11:51:49 JohnE

我正在尋找ND插值，如scipy中的griddata – denfromufa 2015-04-06 13:48:36

@denfromufa - 你應該將這個問題加入到這個問題中。我也會添加numpy作爲標籤（而不是數據框）。您可能需要從該問題中刪除大熊貓內插，因爲它看起來不會在這裏有任何用處。當然只是建議。 – JohnE 2015-04-06 13:53:51

我在https://groups.google.com/forum/#!topic/pydata/ido98vCx86Q上發佈了這個問題 – denfromufa 2015-04-06 14:35:16

def multireindex(_df, new_multi_index, method='linear',copy=True): 
    #from scipy.interpolate import griddata 
    #import numpy as np 
    #import pandas as pd 
    _points=np.array(_df.index.values.tolist()) 
    dfn=dict() 
    for aclm in _df.columns: 
     dfn[aclm] = griddata(_points, _df[aclm], 
         np.array(new_multi_index), method=method) 
    dfn=pd.DataFrame(dfn,index=pd.MultiIndex.from_tuples(
      new_multi_index,names=_df.index.names)) 
    return pd.concat([dfn,_df]) 

import pandas as pd 
import numpy as np 
#import numpy.random as npr 
#df1=pd.DataFrame(npr.rand(10,5)) 
arr1=np.random.rand(100) 
arr2=np.random.rand(100) 
arr1,arr2=[np.round(a*b) for a,b in 
       zip([arr1,arr2],[100,100,1000])] 
df1=pd.DataFrame(zip(arr1,arr2,arr1+arr2,arr1*arr2),columns=['x','y','plus','times']) 
df1.set_index(['x','y'],inplace=True) 
from scipy.interpolate import griddata 
new_points=[(20.0,20.0),(25.0,25.0)] 
df2=multireindex(df1,new_points) 
df2.head()

來源

2015-04-20 06:19:51 denfromufa

根據您有多少行有不同的方法。

我用我的MAC Pro（16G RAM）處理7000萬行數據集。我必須按照product_id，client_id和星期編號對行進行分組，以計算客戶的需求。就像你的例子一樣，這個數據集並沒有每週的每個產品。所以我嘗試以下方法：

查找每個產品的缺失週數，填寫並重新索引。即使將數據集分成幾部分，也需要太多的時間和內存來返回結果。
查找每個產品的缺失週數，創建一個新的數據框，並用原始數據框連接。效率更高，但仍然使用太多時間（幾個小時）和內存。
畢竟我在Stackoverflow上找到this post。我嘗試在空白的星期內用「-9999」（一個不存在的數字）填充星期數字，並填充它。之後，我用np.nan替換「-9999」，然後我得到我想要的。它只需要幾分鐘就可以完成。我認爲這是正確的做法。

作爲結論，如果你有有限的資源，「重新索引」可能只是一個小的數據集可以使用（我用的是第一種方式來處理500萬行的一塊，它返回以分鐘爲單位），除「棧/棧」可以處理更大的數據幀。

來源

2016-08-26 07:14:43 soulcoder

熊貓多指標數據幀，缺失值的ND插值

回答

相關問題