2016-05-16 97 views
2

我是熊貓新手,我試圖映射多個列而不是一個。 This page告訴我如何用pd.Series做到這一點,但我無法弄清楚如何映射多個columns映射數據幀不是系列熊貓

這是我的兩個DataFrames我正在嘗試map

data2=pd.DataFrame(np.random.randn(5,2),index=range(0,5),columns=['x','y']) 
data2['Cluster']=['A','B','A','B','C'] 
centers2=pd.DataFrame(np.random.randint(0,10,size=(3,2)),index= ['A','B','C'],columns=['x','y']) 

這裏data2樣子:

data2 

    x   y    Cluster 
0 0.151212 -0.168855  A 
1 -0.078935 1.933378  B 
2 -0.388903 0.444610  A 
3 0.622089 1.609730  B 
4 -0.346856 1.095834  C 

centers2樣子:

centers2 
    x y 
A 6 4 
B 6 0 
C 4 1 

我希望在data2創建兩個單獨的列,用適當的center2匹配。這是我的手動嘗試

data2['Centers.x']=[6,6,6,6,4] 
data2['Centers.y']=[4,0,4,0,1] 
data2 
      x   y Cluster Centers.x Centers.y 
0 0.151212 -0.168855  A   6   4 
1 -0.078935 1.933378  B   6   0 
2 -0.388903 0.444610  A   6   4 
3 0.622089 1.609730  B   6   0 
4 -0.346856 1.095834  C   4   1 

我該怎麼做map函數? (我知道如何使用循環做到這一點,我需要一個量化的解決方案。)

回答

1

.merge()最接近pd.Series.map()pd.DataFrame。您可以使用suffixes=[]關鍵字將自定義標題添加到重疊列,例如suffices=['', '_centers']

注意pd.Series沒有.merge()pd.DataFrame沒有.map()

隨着

data2 
      x   y Cluster 
0 -1.406449 -0.244859  A 
1 1.002103 0.214346  B 
2 0.353894 0.353995  A 
3 1.249199 -0.661904  B 
4 0.623962 -1.754789  C 

centers2 
    x y 
A 0 9 
B 6 9 
C 0 6 

你得到:

data2.merge(centers2, left_on='Cluster', right_index=True, suffixes=['', '_centers']).sort_index() 

      x   y Cluster x_centers y_centers 
0 -1.406449 -0.244859  A   0   9 
1 1.002103 0.214346  B   6   9 
2 0.353894 0.353995  A   0   9 
3 1.249199 -0.661904  B   6   9 
4 0.623962 -1.754789  C   0   6 

也有.join()選項,這是另一種方式來訪問.merge(),或pd.concat()如果.merge()index兩個DataFrame - 從來源:

def join(self, other, on=None, how='left', lsuffix='', rsuffix='', 
     sort=False): 
    return self._join_compat(other, on=on, how=how, lsuffix=lsuffix, 
          rsuffix=rsuffix, sort=sort) 

def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='', 
       sort=False): 
    from pandas.tools.merge import merge, concat 

    if isinstance(other, Series): 
     if other.name is None: 
      raise ValueError('Other Series must have a name') 
     other = DataFrame({other.name: other}) 

    if isinstance(other, DataFrame): 
     return merge(self, other, left_on=on, how=how, 
        left_index=on is None, right_index=True, 
        suffixes=(lsuffix, rsuffix), sort=sort) 
    else: 
     if on is not None: 
      raise ValueError('Joining multiple DataFrames only supported' 
          ' for joining on index') 
+0

是的,它是最簡單的,但排序是改變。 – jezrael

+0

沒錯,添加'.sort_index()'以確保獲得排序。 – Stefan

+0

很想 - 它在哪裏? – Stefan

1

您可以使用concatmap

print pd.concat([data2.x, data2.y, 
       data2.Cluster, 
       data2.Cluster.map(centers2.x.to_dict()), 
       data2.Cluster.map(centers2.y.to_dict())], 
       axis=1, 
       keys=['x','y','Cluster','Centers.x','Centers.y']) 

      x   y Cluster Centers.x Centers.y 
0 -0.247322 -0.699005  A   6   5 
1 -0.026692 0.551841  B   1   4 
2 -1.730480 -0.170510  A   6   5 
3 0.814357 -0.204729  B   1   4 
4 2.387925 -0.503993  C   1   0 

解決方案與joindocs

print data2.join(centers2, on='Cluster', rsuffix ='_centers') 

      x   y Cluster x_centers y_centers 
0 -0.247322 -0.699005  A   6   5 
1 -0.026692 0.551841  B   1   4 
2 -1.730480 -0.170510  A   6   5 
3 0.814357 -0.204729  B   1   4 
4 2.387925 -0.503993  C   1   0 

另一種解決方案與mergejoin相同,但添加了2參數:

print data2.merge(centers2, 
        left_on='Cluster', 
        right_index=True, 
        suffixes=['', '_centers'], 
        sort=False, 
        how='left') 

時序

len(df)=5k

data2 = pd.concat([data2]*1000).reset_index(drop=True) 

def root(data2, centers2):     
    data2['Centers.x'] = data2.apply(lambda row: centers2.get_value(row['Cluster'], 'x'), axis=1) 
    data2['Centers.y'] = data2.apply(lambda row: centers2.get_value(row['Cluster'], 'y'), axis=1)     
    return data2 

In [117]: %timeit root(data2, centers2) 
1 loops, best of 3: 267 ms per loop 

In [118]: %timeit data2.merge(centers2, left_on='Cluster', right_index=True, suffixes=['', '_centers'], sort=False, how='left') 
1000 loops, best of 3: 1.71 ms per loop 

In [119]: %timeit data2.join(centers2, on='Cluster', rsuffix ='_centers', sort=False, how='left') 
1000 loops, best of 3: 1.71 ms per loop 

In [120]: %timeit pd.concat([data2.x, data2.y, data2.Cluster, data2.Cluster.map(centers2.x.to_dict()), data2.Cluster.map(centers2.y.to_dict())], axis=1, keys=['x','y','Cluster','Centers.x','Centers.y']) 
100 loops, best of 3: 2.15 ms per loop 

In [121]: %timeit data2.merge(centers2, left_on='Cluster', right_index=True, suffixes=['', '_centers']).sort_index() 
100 loops, best of 3: 2.68 ms per loop 
+0

增加了'Stefan Jansen'解決方案的時間。 – jezrael