2016-02-13 54 views
3

我正在Pandas/Python中使用DataFrame,每一行都有一個ID(這不是唯一的),我想修改數據框以添加一個名爲每行都有多個匹配的ID。修改基於多行的python中的熊貓數據框

Starting with: 

    ID Name Rate 
0 1 A 65.5 
1 2 B 67.3 
2 2 C 78.8 
3 3 D 65.0 
4 4 E 45.3 
5 5 F 52.0 
6 5 G 66.0 
7 6 H 34.0 
8 7 I 2.0 

Trying to get to: 

    ID Name Rate Secondname 
0 1 A 65.5  None 
1 2 B 67.3  C 
2 2 C 78.8  B 
3 3 D 65.0  None 
4 4 E 45.3  None 
5 5 F 52.0  G 
6 5 G 66.0  F 
7 6 H 34.0  None 
8 7 I 2.0  None 

我的代碼:

import numpy as np 
import pandas as pd 


mydict = {'ID':[1,2,2,3,4,5,5,6,7], 
      'Name':['A','B','C','D','E','F','G','H','I'], 
      'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]} 

df=pd.DataFrame(mydict) 

df['Newname']='None' 

for i in range(0, df.shape[0]-1): 
    if df.irow(i)['ID']==df.irow(i+1)['ID']:  
     df.irow(i)['Newname']=df.irow(i+1)['Name'] 

這將導致以下錯誤:

A value is trying to be set on a copy of a slice from a DataFrame 

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy 
df.irow(i)['Newname']=df.irow(i+1)['Secondname'] 
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664:  SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame 

See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy 
self.loc[key] = value 

任何幫助將非常感激。

回答

4

您可以使用groupby自定義功能f,它使用shiftcombine_first

def f(x): 
    #print x 
    x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1)) 
    return x 

print df.groupby('ID').apply(f) 
    ID Name Rate Secondname 
0 1 A 65.5  NaN 
1 2 B 67.3   C 
2 2 C 78.8   B 
3 3 D 65.0  NaN 
4 4 E 45.3  NaN 
5 5 F 52.0   G 
6 5 G 66.0   F 
7 6 H 34.0  NaN 
8 7 I 2.0  NaN 

你能避免groupby並找到duplicated,然後通過loc與列Name,然後shiftcombine_first和最後填寫助手列drop幫手欄目:

print df.duplicated('ID', keep='first') 
0 False 
1 False 
2  True 
3 False 
4 False 
5 False 
6  True 
7 False 
8 False 
dtype: bool 
print df.duplicated('ID', keep='last') 
0 False 
1  True 
2 False 
3 False 
4 False 
5  True 
6 False 
7 False 
8 False 
dtype: bool 
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name'] 
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name'] 
print df 
    ID Name Rate first last 
0 1 A 65.5 NaN NaN 
1 2 B 67.3 NaN B 
2 2 C 78.8 C NaN 
3 3 D 65.0 NaN NaN 
4 4 E 45.3 NaN NaN 
5 5 F 52.0 NaN F 
6 5 G 66.0 G NaN 
7 6 H 34.0 NaN NaN 
8 7 I 2.0 NaN NaN 
df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1)) 
df = df.drop(['first', 'l1'], axis=1) 
print df 
    ID Name Rate SecondName 
0 1 A 65.5  NaN 
1 2 B 67.3   C 
2 2 C 78.8   B 
3 3 D 65.0  NaN 
4 4 E 45.3  NaN 
5 5 F 52.0   G 
6 5 G 66.0   F 
7 6 H 34.0  NaN 
8 7 I 2.0  NaN 

TESTING:(在的Roman Kh測試溶液時具有錯誤輸出)

len(df) = 9

In [154]: %timeit jez(df1) 
100 loops, best of 3: 15 ms per loop 

In [155]: %timeit jez2(df2) 
100 loops, best of 3: 3.45 ms per loop 

In [156]: %timeit rom(df) 
100 loops, best of 3: 3.55 ms per loop  

len(df) = 90k

In [158]: %timeit jez(df1) 
10 loops, best of 3: 57.1 ms per loop 

In [159]: %timeit jez2(df2) 
10 loops, best of 3: 36.4 ms per loop 

In [160]: %timeit rom(df) 
10 loops, best of 3: 40.4 ms per loop 
import pandas as pd 

mydict = {'ID':[1,2,2,3,4,5,5,6,7], 
      'Name':['A','B','C','D','E','F','G','H','I'], 
      'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]} 

df=pd.DataFrame(mydict) 
print df 


df = pd.concat([df]*10000).reset_index(drop=True) 

df1 = df.copy() 
df2 = df.copy() 

def jez(df): 
    def f(x): 
     #print x 
     x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1)) 
     return x 

    return df.groupby('ID').apply(f) 


def jez2(df): 
    #print df.duplicated('ID', keep='first') 
    #print df.duplicated('ID', keep='last') 
    df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name'] 
    df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name'] 
    #print df 

    df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1)) 
    df = df.drop(['first', 'last'], axis=1) 
    return df 



def rom(df): 

    # cpIDs = True if the next row has the same ID 
    df['cpIDs'] = df['ID'][:-1] == df['ID'][1:] 
    # fill in the last row (get rid of NaN) 
    df.iloc[-1,df.columns.get_loc('cpIDs')] = False 
    # ShiftName == Name of the next row 
    df['ShiftName'] = df['Name'].shift(-1) 
    # fill in SecondName 
    df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName'] 
    # remove columns 
    del df['cpIDs'] 
    del df['ShiftName'] 
    return df 


print jez(df1) 
print jez2(df2) 
print rom(df) 
print jez(df1) 
    ID Name Rate Secondname 
0 1 A 65.5  NaN 
1 2 B 67.3   C 
2 2 C 78.8   B 
3 3 D 65.0  NaN 
4 4 E 45.3  NaN 
5 5 F 52.0   G 
6 5 G 66.0   F 
7 6 H 34.0  NaN 
8 7 I 2.0  NaN 
print jez2(df2) 
    ID Name Rate SecondName 
0 1 A 65.5  NaN 
1 2 B 67.3   C 
2 2 C 78.8   B 
3 3 D 65.0  NaN 
4 4 E 45.3  NaN 
5 5 F 52.0   G 
6 5 G 66.0   F 
7 6 H 34.0  NaN 
8 7 I 2.0  NaN 
print rom(df) 
    ID Name Rate SecondName 
0 1 A 65.5  NaN 
1 2 B 67.3   C 
2 2 C 78.8  NaN 
3 3 D 65.0  NaN 
4 4 E 45.3  NaN 
5 5 F 52.0   G 
6 5 G 66.0  NaN 
7 6 H 34.0  NaN 
8 7 I 2.0  NaN 

編輯:

如果有更多的複製對有相同的名字,使用shift創建firstlast列:

df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name'] 
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name'] 
+0

很好的答案! 但是,如果有幾個重複的行呢?不過,這是OP的一個問題。 –

+0

真的很有幫助 - 謝謝。我實施了你的「小組合作」,效果很好。 「重複」方法是否有優勢,還是僅僅是一種替代方法?謝謝。 – LJH11

+0

答案很簡單 - 不是'groupby'更快。我小''dataframes''5次,並在大'數據框''0.6倍,你可以在我的測試中看到'%timeit jez(df1)'vs'%timeit jez2(df2)' – jezrael

0

如果你的數據幀進行排序通過ID,你可能會添加一個新的列,它比較當前的ID租一行下一行的ID:

# cpIDs = True if the next row has the same ID 
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:] 
# fill in the last row (get rid of NaN) 
df.iloc[-1,df.columns.get_loc('cpIDs')] = False 
# ShiftName == Name of the next row 
df['ShiftName'] = df['Name'].shift(-1) 
# fill in SecondName 
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName'] 
# remove columns 
del df['cpIDs'] 
del df['ShiftName'] 

當然,你可以縮短上面的代碼,我故意使它更長,但更容易理解。 根據您的數據幀大小,它可能非常快(可能是最快的),因爲它不使用任何複雜的操作。

P.S.作爲一個方面說明,在處理數據框和numpy數組時,儘量避免任何循環。幾乎總是可以找到所謂的矢量解決方案,它可以在整個陣列或大範圍上運行,而不是在單個單元格和行上運行。

+0

請檢查您的答案,因爲它有錯誤的輸出。 – jezrael

+0

是的,你是對的,我誤解了一點任務。儘管如此,我仍然保持不變,因爲第二種解決方案非常棒。 –