2015-01-31 42 views
0

我不知道發生了什麼,標題只是一階近似。我試圖把兩個數據幀:熊貓加入:無法識別加入列

>>> df_sum.head() 
     TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \ 
0 20030100013280  0  0  0  0  0  0 
1 20030100013344  0  0  0  0  0  0 
2 20030100013352  60  0  0  0  0  0 
3 20030100013848  0  0  0  0  0  0 
4 20030100014165  0  0  0  0  0  0 

    t070201 t070299 shopping year 
0  0  0   0 2003 
1  0  0   0 2003 
2  0  0  60 2003 
3  0  0   0 2003 
4  0  0   0 2003 
>>> emp.head() 
     TUCASEID status 
0 20030100013280 emp 
1 20030100013344 emp 
2 20030100013352 emp 
4 20030100014165 emp 
5 20030100014169 emp 

這是該數據幀,我想加入他們在公共列TUCASEID,其中有交叉:

>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID) 
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462, 
     20131212132469, 20131212132475]) 

現在...

>>> df_sum.join(emp, on='TUCASEID', how='inner') 
Traceback (most recent call last): 
    File "<input>", line 1, in <module> 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join 
    rsuffix=rsuffix, sort=sort) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat 
    suffixes=(lsuffix, rsuffix), sort=sort) 
    File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge 
    return op.get_result() 
    File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result 
    rdata.items, rsuf) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix 
    to_rename) 
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object') 

嗯,這很奇怪,出現在這兩個數據幀的唯一列是一個參加過,但是那好,我們同意[1]:

>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r') 
Empty DataFrame 
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status] 
Index: [] 

儘管存在巨大的交叉點。這裏發生了什麼?

>>> pd.__version__ 
'0.15.0' 

[1]:我實際上執行整數爲D型接合柱的,因爲它表示「對象」在那裏,並沒有區別:

>>> emp.dtypes 
TUCASEID  int64 
status  object 
dtype: object 
>>> df_sum.dtypes 
TUCASEID int64 
(...) 
shopping int64 
year  int64 
dtype: object 
+0

您的索引值不匹配,爲什麼不乾脆 此外,所謂的這種方式,當合併爲空合併它們'df_sum.merge(emp,on ='TUCASEID',how ='outer')'或者你只是想爲每個'TUCASEID'行添加'status'列感興趣?在這種情況下做'df_sum ['status'] = df ['sum ['TUCASEID']。map(emp.set_index('TUCASEID')' – EdChum 2015-01-31 22:24:13

+0

@EdChum好吧,我想看看替代方案。索引值不匹配?我已經指定了替代'on ='列。 – FooBar 2015-01-31 22:25:39

+0

不知道'join'加在索引上,奇怪的是我可以重新創建的行爲,但是我建議應該使用的其他方法 – EdChum 2015-01-31 22:27:04

回答

2

df.join通常調用pd.merge(除了在特殊情況下當它呼叫concat)。因此,任何東西join都可以做,merge也可以做 也。雖然可能不是嚴格正確,但我傾向於僅在 加入索引時使用df.join,並使用pd.merge加入列。

因此,我可以重現這個問題你描述:

import numpy as np 
import pandas as pd 

df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
         index=list('ABCDEF'), columns=list('XY')) 
emp = pd.DataFrame(np.arange(6*2).reshape((6,2)), 
        index=list('ABCDEF'), columns=list('XZ')) 
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner')) 

# Empty DataFrame 
# Columns: [X, Y, X_r, Z] 
# Index: [] 

pd.merge按預期工作 - 而無需提供rsuffix

print(pd.merge(df_sum, emp, on='X') 

產量

X Y Z 
0 0 1 1 
1 2 3 3 
2 4 5 5 
3 6 7 7 
4 8 9 9 
5 10 11 11 

Under the hooddf_sum.join通話合併這種方式:

if isinstance(other, DataFrame): 
     return merge(self, other, left_on=on, how=how, 
        left_index=on is None, right_index=True, 
        suffixes=(lsuffix, rsuffix), sort=sort) 

所以,即使您使用df_sum.join(emp, on='...'),引擎蓋下,熊貓轉換這pd.merge(df_sum, emp, left_on='...')

In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True) 
Out[228]: 
Empty DataFrame 
Columns: [X, X_x, Y, X_y, Z] 
Index: [] 

因爲所需的left_on='X'需求是on='X'爲合併成功:

In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True) 
Out[233]: 
    X Y Z 
A 0 1 1 
B 2 3 3 
C 4 5 5 
D 6 7 7 
E 8 9 9 
F 10 11 11