2016-10-22 82 views
2

在我的數據框中,我將最終得到一個只有少數非nan值的列。我想使用非nan值作爲包含NaN值的所有前面的行的分組變量。爲了模擬它,我做了如下排列:用後續值填充數組

count = np.array([np.NaN,np.NaN,np.NaN,3,np.NaN,np.NaN,6,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,12]) 
count = Series(count) 

對於這陣我能創造一個填充功能

def pad_expsamp_time(array): 
    sect = np.zeros(array.size) # create array filled with zeros 
    inds = array.index[array.notnull()] # select the non-zero values 
    rev_inds = inds[::-1] # sort high to low 
    # fill array with value until index of value. Repeat for lower values. 
    for i in rev_inds: 
     sect[:i] = i 
    return Series(sect) 

此功能,當它可以假設非的索引nan值等於實際值。但是,如何在索引不等於內容時填充數組?


例如,如果陣列計數是什麼:

count = np.array([np.NaN,np.NaN,np.NaN,1,np.NaN,np.NaN,2,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,3]) 

和期望的輸出是

count = np.array([1,1,1,1,2,2,2,3,3,3,3,3,3] 

可能的是,有在陣列的端部的NaN。我希望這些留下NaN,這樣數據框就會忽略它們。

count = np.array([np.NaN,np.NaN,np.NaN,1,np.NaN,np.NaN,2,np.NaN,np.NaN,3,np.NaN,np.NaN]) 
# Will become: 
count = np.array([1,1,1,1,2,2,2,3,3,3,np.nan,np.nan] 
+0

請問最後一個元素是'NaN'? – Divakar

+0

@Divakar是的,它確實可以 –

+0

那麼,你需要它來填充東西嗎?如果是這樣,我們應該填寫什麼?添加一個案例可能? – Divakar

回答

2

IIUC可以簡單的用大熊貓bfill()方法:

您的樣本:

In [89]: s = pd.Series(np.array([np.nan,np.nan,np.nan,1,np.nan,np.nan,2,np.nan,np.nan,3,np.nan,np.nan])) 

In [90]: s 
Out[90]: 
0  NaN 
1  NaN 
2  NaN 
3  1.0 
4  NaN 
5  NaN 
6  2.0 
7  NaN 
8  NaN 
9  3.0 
10 NaN 
11 NaN 
dtype: float64 

In [91]: s.bfill() 
Out[91]: 
0  1.0 
1  1.0 
2  1.0 
3  1.0 
4  2.0 
5  2.0 
6  2.0 
7  3.0 
8  3.0 
9  3.0 
10 NaN 
11 NaN 
dtype: float64 

Divakar的樣本:

In [81]: s = pd.Series(array([ nan, nan, nan, 6., nan, nan, 5., nan, nan, nan, nan, nan, 2.])) 

In [82]: s 
Out[82]: 
0  NaN 
1  NaN 
2  NaN 
3  6.0 
4  NaN 
5  NaN 
6  5.0 
7  NaN 
8  NaN 
9  NaN 
10 NaN 
11 NaN 
12 2.0 
dtype: float64 

In [83]: s.bfill() 
Out[83]: 
0  6.0 
1  6.0 
2  6.0 
3  6.0 
4  5.0 
5  5.0 
6  5.0 
7  2.0 
8  2.0 
9  2.0 
10 2.0 
11 2.0 
12 2.0 
dtype: float64 

In [84]: s = pd.Series(array([ nan, nan, nan, 1., nan, nan, 2., nan, nan, nan, nan, nan, 3.])) 

In [85]: s.bfill() 
Out[85]: 
0  1.0 
1  1.0 
2  1.0 
3  1.0 
4  2.0 
5  2.0 
6  2.0 
7  3.0 
8  3.0 
9  3.0 
10 3.0 
11 3.0 
12 3.0 
dtype: float64 

In [86]: s = pd.Series(array([ nan, nan, nan, 1., nan, nan, 2., nan, nan, 3., nan, nan])) 

In [87]: s.bfill() 
Out[87]: 
0  1.0 
1  1.0 
2  1.0 
3  1.0 
4  2.0 
5  2.0 
6  2.0 
7  3.0 
8  3.0 
9  3.0 
10 NaN 
11 NaN 
dtype: float64 
+0

'熊貓'builtin做魔術! – Divakar

+0

@Divakar,國際海事組織這是相當標準的熊貓操作,所以這裏沒有魔法;) – MaxU

+0

哇?真?所有的麻煩都沒有了哈哈。如果只有我早就知道了。現在我也可以停止寫一個前向填充函數了...... –

2

這裏有一個量化的方法 -

# Append False at either sides of NaN mask as we try to find start & 
# stop of each NaN interval by looking for rising and falling edges 
mask = np.hstack((False,np.isnan(count),False)) 
start = np.flatnonzero(mask[1:] > mask[:-1]) 
stop = np.flatnonzero(mask[1:] < mask[:-1]) 
lens = stop - start 

# Account for NaNs if any at the end of input that might throw off stop values 
stop = stop.clip(max=count.size-1) 

# Assign values 
count[mask[1:-1]] = count[stop].repeat(lens) 

樣品試驗 -

案例#1:

In [103]: count 
Out[103]: 
array([ nan, nan, nan, 6., nan, nan, 5., nan, nan, nan, nan, 
     nan, 2.]) 

In [104]: # Listed code ... 

In [105]: count 
Out[105]: array([ 6., 6., 6., 6., 5., 5., 5., 2., 2., 2., 2., 2., 2.]) 

案例#2:

In [118]: count 
Out[118]: 
array([ nan, nan, nan, 1., nan, nan, 2., nan, nan, nan, nan, 
     nan, 3.]) 

In [119]: # Listed code ... 

In [120]: count 
Out[120]: array([ 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., 3., 3., 3.]) 

案例#3:

In [114]: count 
Out[114]: 
array([ nan, nan, nan, 1., nan, nan, 2., nan, nan, 3., nan, 
     nan]) 

In [115]: # Listed code ... 

In [116]: count 
Out[116]: 
array([ 1., 1., 1., 1., 2., 2., 2., 3., 3., 3., nan, 
     nan]) 
+0

令人驚歎!非常感謝! –