2016-03-03 22 views
0

我有一個樣品熊貓數據幀如下:查找列的值替換吶

df = pd.DataFrame({ 
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']), 
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]), 
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan]) 
}) 


name         notes          occupation 
NaN      meth cook makes meth with purity of over 96%    meth cook 
Walter White   meth cook is also called Heisenberg        NaN 
NaN      meth cook has cancer           NaN 
NaN      he is known as the best meth cook        NaN 
NaN      Meth Dealer added chili powder to his batch      NaN 
NaN      Meth Dealer learned to make the best meth      NaN 
Jessie Pinkman   everyone goes to this Meth Dealer for best shot    meth dealer 
NaN      girlfriend of the meth dealer died        NaN 
Saul Goodman   this lawyer is a people pleasing person       NaN 
NaN      cinnabon has now hired the lawyer as a baker     lawyer 
NaN      lawyer had to take off in the end        NaN 
NaN      lawyer has a lot of connections who knows other guy    NaN 

因此,我們有一個總的三個職業:

pd.unique(df.occupation) 

array(['meth cook', 'meth dealer', 'lawyer'], dtype=object) 

我會喜歡在「筆記」列中查找「職業」值,並且如果職業中已經存在某個值,則將該行的所有缺失值替換爲匹配的職業。 例如:在第二行,職業缺失。但是,如果我們查找('meth cook','meth dealer','lawyer')的'notes'欄,我們會看到'meth cook'存在於第二行的'notes'列中。因此,缺少職業應該充滿「甲基廚師

我想:

df.occupation[df.occupation.notnull()].apply(lambda x: df.occupation.str.extract('('+x+')')) 

但是,它並沒有給我我想要的結果。我想看看結果如下:

name         notes          occupation 
NaN      meth cook makes meth with purity of over 96%    meth cook 
Walter White   meth cook is also called Heisenberg       meth cook 
NaN      meth cook has cancer          meth cook 
NaN      he is known as the best meth cook       meth cook 
NaN      Meth Dealer added chili powder to his batch     meth dealer 
NaN      Meth Dealer learned to make the best meth     meth dealer 
Jessie Pinkman   everyone goes to this Meth Dealer for best shot    meth dealer 
NaN      girlfriend of the meth dealer died       meth dealer 
Saul Goodman   this lawyer is a people pleasing person      lawyer 
NaN      cinnabon has now hired the lawyer as a baker     lawyer 
NaN      lawyer had to take off in the end        lawyer 
NaN      lawyer has a lot of connections who knows other guy   lawyer 

有人可以提供任何投入?

回答

1

您可以通過使用str.contains的子集劃分您的數據幀從occupation這是notes填充缺失值與循環做到這一點:

occ = pd.unique(df.occupation[df.occupation.notnull()]) 

for pa in occ: 
    subset = df.notes.str.contains(pa, case=False) 
    df.occupation[subset] = df.occupation[subset].fillna(pa) 


In [40]: df 
Out[40]: 
       name            notes occupation 
0    NaN  meth cook makes meth with purity of over 96%  meth cook 
1  Walter White    meth cook is also called Heisenberg  meth cook 
2    NaN        meth cook has cancer  meth cook 
3    NaN     he is known as the best meth cook  meth cook 
4    NaN  Meth Dealer added chili powder to his batch meth dealer 
5    NaN   Meth Dealer learned to make the best meth meth dealer 
6 Jessie Pinkman everyone goes to this Meth Dealer for best shot meth dealer 
7    NaN     girlfriend of the meth dealer died meth dealer 
8  Saul Goodman   this lawyer is a people pleasing person  lawyer 
9    NaN  cinnabon has now hired the lawyer as a baker  lawyer 
10    NaN     lawyer had to take off in the end  lawyer 
11    NaN lawyer has a lot of connections who knows othe...  lawyer 
+0

嘿感謝!我在等待的時候做了以下事情:'occup_list = list(pd.unique(df.occupation)) occupation_list = [x for occupation_list如果str(x)!='nan'] df ['occupation']。fillna (df.loc [pd.isnull(df.occupation)] ['notes']。apply(lambda x:filter(lambda occ:re.search(occ.lower(),x.lower()),occupation_list) 0]),inplace = True) '這似乎工作。但是,當我將它應用於我的實際數據集時,出現以下錯誤:列出索引。我不確定filter()是如何工作的。 –