2016-08-22 80 views
3

我在我的數據集中有很多列&我需要更改某些變量的值。我做如下字典pd.DataFrame的循環

import pandas as pd 
import numpy as np 
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5}) 

選擇

df1 = df[['one', 'two']] 

字典

map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'} 

和循環

df2=[] 
for i in df1.values: 
    np = [ map[x] for x in i] 
    df2.append(np) 

然後我改變列

df['one'] = [row[0] for row in df2] 
df['two'] = [row[1] for row in df2] 

它的作品,但它是非常長的路。如何縮短它?

+1

['df.replace'](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html)? – DeepSpace

回答

2

您可以使用Series.map()循環訪問列:

cols = ['one', 'two'] 
mapd = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'} 

for col in cols: 
    df[col] = df[col].map(mapd).fillna(df[col]) 


df 
Out: 
    one three two 
0 d  a b 
1 c  d a 
2 d  a b 
3 c  d a 
4 d  a b 
5 c  d a 
6 d  a b 
7 c  d a 
8 d  a b 
9 c  d a 

時序:

df = pd.DataFrame({'one':['a' , 'b']*5000000, 
        'two':['c' , 'd']*5000000, 
        'three':['a' , 'd']*5000000}) 

%%timeit 
for col in cols: 
    df[col].map(mapd).fillna(df[col]) 
1 loop, best of 3: 1.71 s per loop 

%%timeit 
for col in cols: 
... colSet = set(df[col].values); 
... colMap = {k:v for k,v in mapd.items() if k in colSet} 
... df.replace(to_replace={col:colMap}) 
1 loop, best of 3: 3.35 s per loop 


%timeit df[cols].stack().map(mapd).unstack() 
1 loop, best of 3: 9.18 s per loop 
2

傳遞整個地圖的關口只有 'A', 'B' 值的效率不高。首先檢查df col中的值。然後映射只有適合自己的,如下:

>>> cols = ['one', 'two']; 
>>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}; 

>>> for col in cols: 
... colSet = set(df[col].values); 
... colMap = {k:v for k,v in map.items() if k in colSet}; 
... df.replace(to_replace={col:colMap},inplace=True);#not efficient like rly 
... 
>>> df 
    one three two 
0 d  a b 
1 c  d a 
2 d  a b 
3 c  d a 
4 d  a b 
5 c  d a 
6 d  a b 
7 c  d a 
8 d  a b 
9 c  d a 
>>> 
#OR 
In [12]: %%timeit 
...: for col in cols: 
...: colSet = set(df[col].values); 
...: colMap = {k:v for k,v in map.items() if k in colSet}; 
...: df[col].map(colMap) 
...: 
...: 
1 loop, best of 3: 1.93 s per loop 
#OR WHEN INPLACE 
In [8]: %%timeit 
    ...: for col in cols: 
    ...: colSet = set(df[col].values); 
    ...: colMap = {k:v for k,v in map.items() if k in colSet}; 
    ...: df[col]=df[col].map(colMap) 
    ...: 
    ...: 
1 loop, best of 3: 2.18 s per loop 

那可能太:

df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5}) 
map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'} 
cols = ['one','two'] 

def func(s): 
    if s.name in cols: 
     s=s.map(map) 
    return s 

print df.apply(func) 

也監視着重疊鍵(也就是說,如果你想在平行的改變可以說A到B和B。 C,但不喜歡A-> B-> C)...

>>> cols = ['one', 'two']; 
>>> map = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'}; 
>>> mapCols = {k:map for k in cols}; 
>>> df.replace(to_replace=mapCols,inplace=True); 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "Q:\Miniconda3\envs\py27a\lib\site-packages\pandas\core\generic.py", line 3352, in replace 
    raise ValueError("Replacement not allowed with " 
ValueError: Replacement not allowed with overlapping keys and values 
+0

這個比慢*效率低一倍。 – ayhan

+0

......只是猜測(它唯一的猜測,但邏輯思維不應該是錯的,只有我的實現可能不是那麼快; /)。這是否df.replace效率不高? –

+0

替換速度通常較慢(即使將其全部應用於整個DataFrame,並與具有地圖的列一起循環),因爲地圖更加具體和有限。我不認爲差異來自您的實施。是什麼讓你認爲'Series.map()'的實際實現會浪費時間通過不存在的鍵? – ayhan

1
df = pd.DataFrame({'one':['a' , 'b']*5, 'two':['c' , 'd']*5, 'three':['a' , 'd']*5}) 
m = { 'a' : 'd', 'b' : 'c', 'c' : 'b', 'd' : 'a'} 

cols = ['one', 'two'] 
df[cols] = df[cols].stack().map(m).unstack() 
df 

enter image description here