2017-04-17 73 views
1

此問題與我發佈的另一個問題有關。 Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe熊貓 - 檢查數據幀列是否包含字典中的鍵值對

我的目標是檢查數據框的兩個不同列是否包含一對字符串值,如果滿足條件,則提取其中一個值。

我有兩個dataframes這樣的:

df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple', 
            'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'], 
        'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']}) 

df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']}) 

In [187]:df1 
Out[187]: 
      consumption    name 
0  squirrelate apple    apple 
1  monkey likesapple   appleisred 
2 monkey banana gets banana is tropical 
3  badger/getsbanana  banana is soft 
4 giraffe eats grass  lemon is sour 
5 badger apple.loves washington apples 
6  elephant is huge    kiwi 
7  elephant/eats/    bananas 
8 squirrel.digsingrass    apples 

In[188]: df2 
Out[188]: 
    creature food 
0 squirrel apple 
1 badger apple 
2 monkey banana 
3 elephant banana 

我想要做的就是測試,如果「蘋果」發生在df1['name']和「松鼠」在df1['consumption']發生,如果兩個條件都滿足,那麼提取「松鼠」從df1['consumption']轉換爲新列df['creature']。結果應該是這樣的:

Out[189]: 
      consumption creature    name 
0  squirrelate apple squirrel    apple 
1  monkey likesapple  NaN   appleisred 
2 monkey banana gets monkey banana is tropical 
3  badger/getsbanana  NaN  banana is soft 
4 giraffe eats grass  NaN  lemon is sour 
5 badger apple.loves badger washington apples 
6  elephant is huge  NaN    kiwi 
7  elephant/eats/ elephant    bananas 
8 squirrel.digsingrass  NaN    apples 

如果沒有配對值約束,我可以做喜歡的事很簡單:

np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan) 

但我必須檢查對,所以我試圖讓一個字典食物鍵和動物,因爲值,則會使所有的生物的字符串VAR對於給定食健,尋找那些使用str.contains:

unique_food = df2.food.unique() 
food_dict = {elem : pd.DataFrame for elem in unique_food} 
for key in food_dict.keys(): 
    food_dict[key] = df2[:][df2.food == key] 

# create key:value pairs of food key and creature strings 
food_strings = {} 
for key, values in food_dict.items(): 
    food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))}) 

In[199]: food_strings 
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'} 

問題是,當我現在嘗試pply str.contains:

for key, value in food_strings.items(): 
    np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) & 
      (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan) 

我得到一個KeyError:

--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-62-7ab718066040> in <module>() 
     1 for key, value in food_strings.items(): 
     2  np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) & 
----> 3    (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan) 

KeyError: 'squirrel|badger' 

當我只是儘量只值,而不是關鍵,它爲第一個鍵:值對,但不是第二:

for key in food_strings.keys(): 
    df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False), 
           df1['consumption'].str.extract('('+food_strings[key]+')', expand=False), 
           np.nan) 

df1 
Out[196]: 
      consumption    name  test 
0  squirrelate apple    apple squirrel 
1  monkey likesapple   appleisred  NaN 
2 monkey banana gets banana is tropical  NaN 
3  badger/getsbanana  banana is soft badger 
4 giraffe eats grass  lemon is sour  NaN 
5 badger apple.loves washington apples badger 
6  elephant is huge    kiwi  NaN 
7  elephant/eats/    bananas  NaN 
8 squirrel.digsingrass    apples squirrel 

我得到的那些匹配蘋果,松鼠|獾,但錯過了香蕉:猴子。

有人可以幫忙嗎?

+0

我想,每個值'food_dict'包含數據幀,而不是字符串。當您在'food_dict.items():'中鍵入值時,發生錯誤。您將'value'作爲數據框提供給'food_strings [value]'。 – titipata

+1

@titipat這是錯別字對不起 - 但很好。我編輯了這個問題,並粘貼了我得到的確切錯誤。 – vagabond

回答

2
d1 = df1.dropna() 
d2 = df2.dropna() 

sump = d1.consumption.values.tolist() 
name = d1.name.values.tolist() 
cret = d2.creature.values.tolist() 
food = d2.food.values.tolist() 

check = np.array(
    [ 
     [c in s and f in n for c, f in zip(cret, food)] 
     for s, n in zip(sump, name) 
    ] 
) 

# create a new series with the index of `d1` where we dropped na 
# then reindex with `df1.index` prior to `assign` 
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index) 
test = test.reindex(df1.index, fill_value='') 
df1.assign(test=test) 

      consumption    name  test 
0  squirrelate apple    apple squirrel 
1  monkey likesapple   appleisred   
2 monkey banana gets banana is tropical monkey 
3  badger/getsbanana  banana is soft   
4 giraffe eats grass  lemon is sour   
5 badger apple.loves washington apples badger 
6  elephant is huge    kiwi   
7  elephant/eats/    bananas elephant 
8 squirrel.digsingrass    apples squirrel 
+0

嗨!謝謝 - 真棒解決方案。一個問題 - 當列表包含無值時,它會中斷。我得到這個錯誤:'TypeError:類型'NoneType'的參數不是可迭代的。我做了一個沒有Nonetypes的列表,sump = df1 [df1.consumption.notnull()] ['consumption']。values.tolist()'用於聲名,名字,克里特和食物。然後'檢查'函數工作,但在df1.assign,我得到:'ValueError:值的長度不匹配索引的長度' – vagabond

+0

通過zip(sump,名稱)迭代時,我必須得到一個NaN/None值當任何一個c - s是nonetype或f - n是nonetype。 – vagabond

+0

dropna不起作用。 。 。那麼我會改變數據框! – vagabond

相關問題