2016-09-22 204 views
1

如何使用熊貓爲每個單個客戶高效追加多個KPI值?熊貓爲單個追加多列

將df與 df和customers df結合會產生一些問題,因爲該國是數據框架的索引並且國籍不在索引中。

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Germany','Austria'], 
          'value':[7,8]}) 

見粉色期望的結果: enter image description here

回答

1

您可以通過merge計數器類別的不匹配:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index.name = 'nationality'  
customers.merge(df['value'].reset_index(), on='nationality', how='outer') 

Image

數據:

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]}) 

這個問題似乎是因爲pivot操作導致您的DF中有CategoricalIndex,並且當您執行reset_index時,您會抱怨那個錯誤。

簡單地做逆向工程在檢查countryKPIdtypescustomers Dataframes何有category提到,通過astype(str)


轉換這些列其string表示再現錯誤和打擊它:

假設DF爲上述提及的:

countryKPI['indicator'] = countryKPI['indicator'].astype('category') 
countryKPI['country'] = countryKPI['country'].astype('category') 
customers['nationality'] = customers['nationality'].astype('category') 

countryKPI.dtypes 
country  category 
indicator category 
value   int64 
dtype: object 

customers.dtypes 
customer   object 
nationality category 
value    int64 
dtype: object 

pivot操作後:

df = pd.pivot_table(data=countryKPI, index=['country'], columns=['indicator']) 
df.index 
CategoricalIndex(['Austria', 'Germany'], categories=['Austria', 'Germany'], ordered=False, 
        name='country', dtype='category') 
# ^^ See the categorical index 

當您執行對reset_index

df.reset_index() 

TypeError: cannot insert an item into a CategoricalIndex that is not already an existing category

爲了解決這個錯誤,簡單地把分類列str類型。

countryKPI['indicator'] = countryKPI['indicator'].astype('str') 
countryKPI['country'] = countryKPI['country'].astype('str') 
customers['nationality'] = customers['nationality'].astype('str') 

現在,reset_index部分作品甚至merge了。

+0

有趣而簡單。但是http://imgur.com/a/PeCyh爲什麼我會爲初始數據集(0,1,2,3)獲得其他幾個值? –

+0

我看到了 - 您的最新修改會使我的最新評論無效。 –

+0

但是,仍然存在以下問題:不能將項目插入到分類索引中,但我不是已有的分類 –

2

我認爲你可以使用concat

df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1)) 
     customer value x z 
Austria second  8 7 7 
Germany first  7 8 9      


print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria  8 7 7 
1 first  Germany  7 8 9 

編輯的評論:

問題是列customers.nationalitycountryKPI.countrydtypescategory,如果有些類別是想念克,它引發錯誤:

ValueError: incompatible categories in categorical concat

解決方案通過union找到共同的類別,然後set_categories

import pandas as pd 
import numpy as np 

countryKPI = pd.DataFrame({'country':['Austria','Germany', 'Germany', 'Austria'], 
          'indicator':['z','x','z','x'], 
          'value':[7,8,9,7]}) 
customers = pd.DataFrame({'customer':['first','second'], 
          'nationality':['Slovakia','Austria'], 
          'value':[7,8]}) 

customers.nationality = customers.nationality.astype('category') 
countryKPI.country = countryKPI.country.astype('category') 

print (countryKPI.country.cat.categories) 
Index(['Austria', 'Germany'], dtype='object') 

print (customers.nationality.cat.categories) 
Index(['Austria', 'Slovakia'], dtype='object') 

all_categories =countryKPI.country.cat.categories.union(customers.nationality.cat.categories) 
print (all_categories) 
Index(['Austria', 'Germany', 'Slovakia'], dtype='object') 

customers.nationality = customers.nationality.cat.set_categories(all_categories) 
countryKPI.country = countryKPI.country.cat.set_categories(all_categories) 
df_pivoted = countryKPI.pivot_table(index='country', 
           columns='indicator', 
           values='value', 
           fill_value=0) 
print (df_pivoted)  
indicator x z 
country   
Austria 7 7 
Germany 8 9 
Slovakia 0 0   

print (pd.concat([customers.set_index('nationality'), df_pivoted], axis=1) 
     .reset_index() 
     .rename(columns={'index':'nationality'}) 
     [['customer','nationality','value','x','z']]) 

    customer nationality value x z 
0 second  Austria 8.0 7 7 
1  NaN  Germany NaN 8 9 
2 first Slovakia 7.0 0 0 

如果需要更好的性能,而不是pivot_table使用groupby

df_pivoted1 = countryKPI.groupby(['country','indicator']) 
         .mean() 
         .squeeze() 
         .unstack() 
         .fillna(0) 
print (df_pivoted1) 
indicator x z 
country    
Austria 7.0 7.0 
Germany 8.0 9.0 
Slovakia 0.0 0.0 

時序

In [177]: %timeit countryKPI.pivot_table(index='country', columns='indicator', values='value', fill_value=0) 
100 loops, best of 3: 6.24 ms per loop 

In [178]: %timeit countryKPI.groupby(['country','indicator']).mean().squeeze().unstack().fillna(0) 
100 loops, best of 3: 4.28 ms per loop 
+0

這幾乎可行 - 但我得到類別連續不兼容的類別的錯誤 –

+1

問題是與真實的數據,對不對?我想,Smale完美地工作。 – jezrael

+0

不幸的是。 –