2017-05-07 334 views
3

我有2個dataframes,我想借此一列從一個和多個基於價值觀的第二創建新列(其他)列大熊貓 - 多列

第一數據框(合併dataframes有條件df1):

df1 = pd.DataFrame({'cond': np.repeat([1,2], 5), 
        'point': np.tile(np.arange(1,6), 2), 
        'value1': np.random.rand(10), 
        'unused1': np.random.rand(10)}) 

    cond point unused1 value1 
0  1  1 0.923699 0.103046 
1  1  2 0.046528 0.188408 
2  1  3 0.677052 0.481349 
3  1  4 0.464000 0.807454 
4  1  5 0.180575 0.962032 
5  2  1 0.941624 0.437961 
6  2  2 0.489738 0.026166 
7  2  3 0.739453 0.109630 
8  2  4 0.338997 0.415101 
9  2  5 0.310235 0.660748 

和第二(df2):

df2 = pd.DataFrame({'cond': np.repeat([1,2], 10), 
        'point': np.tile(np.arange(1,6), 4), 
        'value2': np.random.rand(20)}) 

    cond point value2 
0  1  1 0.990252 
1  1  2 0.534813 
2  1  3 0.407325 
3  1  4 0.969288 
4  1  5 0.085832 
5  1  1 0.922026 
6  1  2 0.567615 
7  1  3 0.174402 
8  1  4 0.469556 
9  1  5 0.511182 
10  2  1 0.219902 
11  2  2 0.761498 
12  2  3 0.406981 
13  2  4 0.551322 
14  2  5 0.727761 
15  2  1 0.075048 
16  2  2 0.159903 
17  2  3 0.726013 
18  2  4 0.848213 
19  2  5 0.284404 

df1['value1']包含EAC值h組合condpoint

我想在df2包含來自df1['value1']值來創建一個新的列(new_column),但值應該在哪裏condpoint跨過2個dataframes匹配的人。

所以我期望的輸出是這樣的:

cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

在這個例子中,我可以只使用瓦/重複,但在現實中df1['value1']不適合這麼整齊地進入其他數據幀。所以,我只是需要做的是基於匹配的condpoint

我已經試過將它們合併,但1)數字不似乎匹配和2)我不想從df1帶過來的任何未使用的列:

df1.merge(df2, left_on=['cond', 'point'], right_on=['cond', 'point'])

請告訴我正確的方式,而不必通過2個dataframes迭代添加這個新列?

回答

2

選項1
對於恩和速度與純pandas,我們可以使用lookup
這將產生相同的輸出,因爲所有的其它選擇,如下所示。

這個概念是將查找數據表示爲二維數組和索引查找值。

d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 

選項2
我們可以做同樣的事情numpy如果值以同樣的方式,他們都在df1提出以提高性能。這非常快!

a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 

選項3
的規範答案是使用merge with the left parameter
但是我們需要預習df1有點釘輸出

d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 

選項4
我覺得這很有趣。構建映射字典和一系列地圖
適合小數據,不適合大數據。見下面的時間。

c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 

c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 

df2.assign(new_column=s2.map(m)) 

OUTPUT

cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

時序
小數據

%%timeit 
a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 
1000 loops, best of 3: 304 µs per loop 

%%timeit 
d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 
100 loops, best of 3: 1.8 ms per loop 

%%timeit 
c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 
​ 
c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 
​ 
df2.assign(new_column=s2.map(m)) 
1000 loops, best of 3: 719 µs per loop 

%%timeit 
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 
100 loops, best of 3: 2.04 ms per loop 

%%timeit 
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 2.01 ms per loop 

%%timeit 
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 2.15 ms per loop 

大數據

df2 = pd.concat([df2] * 10000, ignore_index=True) 

%%timeit 
a = df1.value1.values.reshape(2, -1) 
df2.assign(new_column=a[df2.cond.values - 1, df2.point.values - 1]) 
1000 loops, best of 3: 1.93 ms per loop 

%%timeit 
d1 = df1.set_index(['cond', 'point']).value1.unstack() 
df2.assign(new_column=d1.lookup(df2.cond, df2.point)) 
100 loops, best of 3: 5.58 ms per loop 

%%timeit 
c1 = df1.cond.values.tolist() 
p1 = df1.point.values.tolist() 
v1 = df1.value1.values.tolist() 
m = {(c, p): v for c, p, v in zip(c1, p1, v1)} 
​ 
c2 = df2.cond.values.tolist() 
p2 = df2.point.values.tolist() 
i2 = df2.index.values.tolist() 
s2 = pd.Series({i: (c, p) for i, c, p in zip(i2, c2, p2)}) 
​ 
df2.assign(new_column=s2.map(m)) 
10 loops, best of 3: 135 ms per loop 

%%timeit 
d1 = df1[['cond', 'point', 'value1']].rename(columns={'value1': 'new_column'}) 
df2.merge(d1, 'left') 
100 loops, best of 3: 13.4 ms per loop 

%%timeit 
df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df.rename(columns={'value1': 'new_column'}) 
10 loops, best of 3: 19.8 ms per loop 

%%timeit 
df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df.rename(columns={'value1': 'new_column'}) 
100 loops, best of 3: 18.2 ms per loop 
+0

由於@jezrael。你也是。 – piRSquared

2

您可以使用mergeleft joindrop用於去除unused1列,最後rename柱:

注意:參數on可如果在這兩個DataFrames被忽略只有加入的列是相同的。如果列名更相同,請添加on=['cond', 'point']

df = pd.merge(df2, df1.drop('unused1', axis=1), 'left') 
df = df.rename(columns={'value1': 'new_column'}) 
print (df) 
    cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748 

join(默認left join)與set_index + drop另一種解決方案:

df = df2.join(df1.drop('unused1', axis=1).set_index(['cond', 'point']), on=['cond', 'point']) 
df = df.rename(columns={'value1': 'new_column'}) 
print (df) 
    cond point value2 new_column 
0  1  1 0.990252 0.103046 
1  1  2 0.534813 0.188408 
2  1  3 0.407325 0.481349 
3  1  4 0.969288 0.807454 
4  1  5 0.085832 0.962032 
5  1  1 0.922026 0.103046 
6  1  2 0.567615 0.188408 
7  1  3 0.174402 0.481349 
8  1  4 0.469556 0.807454 
9  1  5 0.511182 0.962032 
10  2  1 0.219902 0.437961 
11  2  2 0.761498 0.026166 
12  2  3 0.406981 0.109630 
13  2  4 0.551322 0.415101 
14  2  5 0.727761 0.660748 
15  2  1 0.075048 0.437961 
16  2  2 0.159903 0.026166 
17  2  3 0.726013 0.109630 
18  2  4 0.848213 0.415101 
19  2  5 0.284404 0.660748