2016-11-21 24 views
1

我用熊貓來分析,我已經創建了一個數據幀:如何重構與大熊貓簡單的數據幀解析代碼

# Initial DF  
A B C 
0 -1 qqq XXX 
1 20 www CCC 
2 30 eee VVV 
3 -1 rrr BBB 
4 50 ttt NNN 
5 60 yyy MMM 
6 70 uuu LLL 
7 -1 iii KKK 
8 -1 ooo JJJ 

我的目標是分析A列,並應用以下條件的數據幀:

  1. 調查每行
  2. 確定df['A'].iloc[index]=-1
  3. 如果真和index=0馬克第一行與要刪除
  4. 如果真和index=N標記最後一行作爲
  5. 如果0<index<Ndf['A'].iloc[index]=-1和先前或下一行被除去包含-1(df['A'].iloc[index+]=-1df['A'].iloc[index-1]=-1),標記行作爲要除去否則跟一般的之前和之後的價值

最終的數據幀應該是這樣的替換 -1:

# Final DF  
A B C 
0 20 www CCC 
1 30 eee VVV 
2 40 rrr BBB 
3 50 ttt NNN 
4 60 yyy MMM 
5 70 uuu LLL 

我可以通過編寫一個應用一個簡單的代碼來實現我的目標以上提到的條件:

進口熊貓作爲PD

# create dataframe 
data = {'A':[-1,20,30,-1,50,60,70,-1,-1], 
     'B':['qqq','www','eee','rrr','ttt','yyy','uuu','iii','ooo'], 
     'C':['XXX','CCC','VVV','BBB','NNN','MMM','LLL','KKK','JJJ']} 
df = pd.DataFrame(data) 

# If df['A'].iloc[index]==-1: 
# - option 1: remove row if first or last row are equal to -1 
# - option 2: remove row if previous or following row contains -1 (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1) 
# - option 3: replace df['A'].iloc[index] if: df['A'].iloc[index]==-1 and (df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1) 
N = len(df.index) # number of rows 
index_vect = [] # store indexes of rows to be deleated 
for index in range(0,N): 

    # option 1 
    if index==0 and df['A'].iloc[index]==-1: 
     index_vect.append(index) 
    elif index>1 and index<N and df['A'].iloc[index]==-1: 

     # option 2 
     if df['A'].iloc[index-1]==-1 or df['A'].iloc[index+1]==-1: 
      index_vect.append(index) 

     # option 3 
     else: 
      df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2) 

    # option 1   
    elif index==N and df['A'].iloc[index]==-1: 
     index_vect.append(index) 

# remove rows to be deleated 
df = df.drop(index_vect).reset_index(drop = True) 

甲您可以看到代碼很長,我想知道您是否可以提出更智能,更高效的方法來獲得相同的結果。 此外,我注意到我的代碼返回了一條警告消息,由行df['A'].iloc[index] = int((df['A'].iloc[index+1]+df['A'].iloc[index-1])/2) 你知道我該如何優化這樣的代碼行嗎?

回答

3

這裏有一個解決方案:

import numpy as np 

# Let's replace -1 by Not a Number (NaN) 
df.ix[df.A==-1,'A'] = np.nan 

# If df.A is NaN and either the previous or next is also NaN, we don't select it 
# This takes care of the condition on the first and last row too 
df = df[~(df.A.isnull() & (df.A.shift(1).isnull() | df.A.shift(-1).isnull()))] 

# Use interpolate to fill with the average of previous and next 
df.A = df.A.interpolate(method='linear', limit=1) 

下面是導致df

A  B  C 
1 20.0 www  CCC 
2 30.0 eee  VVV 
3 40.0 rrr  BBB 
4 50.0 ttt  NNN 
5 60.0 yyy  MMM 
6 70.0 uuu  LLL 

,如果你願意,你可以重置索引。