2014-09-22 106 views
2

我需要組合包含字符串值的多個熊貓Series。該系列是由多個驗證步驟產生的消息。我嘗試將這些消息合併到1 Series以將其附加到DataFrame。問題是結果是空的。在熊貓中組合系列

這是一個例子:

import pandas as pd 

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']}) 

index1 = df[df['a'] == 'b'].index 
index2 = df[df['a'] == 'a'].index 

series = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1) 
series += df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1) 

print series 
# >>> series 
# 0 NaN 
# 1 NaN 

更新

import pandas as pd 

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']}) 

index1 = df[df['a'] == 'b'].index 
index2 = df[df['a'] == 'a'].index 

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1) 
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1) 
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1) 

# series3 causes a ValueError: cannot reindex from a duplicate axis 
series = pd.concat([series1, series2, series3]) 
df['series'] = series 
print df 

UPDATE2

在這個例子中,指數似乎搞混。

import pandas as pd 

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']}) 

index1 = df[df['a'] == 'a'].index 
index2 = df[df['a'] == 'b'].index 
index3 = df[df['a'] == 'c'].index 

series1 = df.iloc[index1].apply(lambda x: x['a'] + '-aaa', axis=1) 
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-bbb', axis=1) 
series3 = df.iloc[index3].apply(lambda x: x['a'] + '-ccc', axis=1) 

print series1 
print 
print series2 
print 
print series3 
print 

df['series'] = pd.concat([series1, series2, series3], ignore_index=True) 
print df 
print 

df['series'] = pd.concat([series2, series1, series3], ignore_index=True) 
print df 
print 

df['series'] = pd.concat([series3, series2, series1], ignore_index=True) 
print df 
print 

這導致了這個輸出:

0 a-aaa 
dtype: object 

1 b-bbb 
dtype: object 

2 c-ccc 
dtype: object 

    a b series 
0 a aa a-aaa 
1 b bb b-bbb 
2 c cc c-ccc 
3 d dd NaN 

    a b series 
0 a aa b-bbb 
1 b bb a-aaa 
2 c cc c-ccc 
3 d dd NaN 

    a b series 
0 a aa c-ccc 
1 b bb b-bbb 
2 c cc a-aaa 
3 d dd NaN 

我希望只在0行一的,只有B的在ROW1和只有c在2行,但事實並非如此......

更新3

下面是一個更好的例子,它應該證明預期的行爲。正如我所說的,用例是對於給定的DataFrame,函數計算每一行並可能返回某些行的錯誤消息,作爲Series(包含一些索引,一些不是;如果沒有錯誤返回,錯誤系列是空的)。

In [12]: 

s1 = pd.Series(['b', 'd'], index=[1, 3]) 
s2 = pd.Series(['a', 'b'], index=[0, 1]) 
s3 = pd.Series(['c', 'e'], index=[2, 4]) 
s4 = pd.Series([], index=[]) 
pd.concat([s1, s2, s3, s4]).sort_index() 

# I'd like to get: 
# 
# 0 a 
# 1 b b 
# 2 c 
# 3 d 
# 4 e 
Out[12]: 
0 a 
1 b 
1 b 
2 c 
3 d 
4 e 
dtype: object 

回答

0

我可能找到了解決方案。我希望有人能對其進行評論...

s1 = pd.Series(['b', 'd'], index=[1, 3]) 
s2 = pd.Series(['a', 'b'], index=[0, 1]) 
s3 = pd.Series(['c', 'e'], index=[2, 4]) 
s4 = pd.Series([], index=[]) 
pd.concat([s1, s2, s3, s4]).sort_index() 


df1 = pd.DataFrame(s1) 
df2 = pd.DataFrame(s2) 
df3 = pd.DataFrame(s3) 
df4 = pd.DataFrame(s4) 

d = pd.DataFrame({0:[]}) 
d = pd.merge(df1, d, how='outer', left_index=True, right_index=True) 
d = d.fillna('') 
d = pd.DataFrame(d['0_x'] + d['0_y']) 

d = pd.merge(df2, d, how='outer', left_index=True, right_index=True) 
d = d.fillna('') 
d = pd.DataFrame(d['0_x'] + d['0_y']) 

d = pd.merge(df3, d, how='outer', left_index=True, right_index=True) 
d = d.fillna('') 
d = pd.DataFrame(d['0_x'] + d['0_y']) 

d = pd.merge(df4, d, how='outer', left_index=True, right_index=True) 
d = d.fillna('') 
d = pd.DataFrame(d['0_x'] + d['0_y']) 
print d 

返回

0 
0 a 
1 bb 
2 c 
3 d 
4 e 
0

concat怎麼樣?

s1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1) 
s2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1) 


s = pd.concat([s1,s2]) 
print s 

1 bb-bbb 
0 a-aaa 
dtype: object 
+0

對不起,我得收回。獲取ValueError(請參閱更新示例)。 – orange 2014-09-22 12:24:33

2

串聯時的默認設置是使用現有的指標,但是如果它們發生碰撞,那麼這將提高一個ValueError爲你找到,所以你需要設置ignore_index=True

In [33]: 

series = pd.concat([series1, series2, series3], ignore_index=True) 
df['series'] = series 
print (df) 
    a b series 
0 a aa bb-bbb 
1 b bb a-aaa 
2 c cc a-ccc 
3 d dd  NaN 

編輯

我想我現在知道你想要什麼了,你可以通過將系列轉換爲數據框然後使用索引合併來實現你想要的功能:

In [96]: 

df = pd.DataFrame({'a': ['a', 'b', 'c', 'd'], 'b': ['aa', 'bb', 'cc', 'dd']}) 

index1 = df[df['a'] == 'b'].index 
index2 = df[df['a'] == 'a'].index 

series1 = df.iloc[index1].apply(lambda x: x['b'] + '-bbb', axis=1) 
series2 = df.iloc[index2].apply(lambda x: x['a'] + '-aaa', axis=1) 
series3 = df.iloc[index2].apply(lambda x: x['a'] + '-ccc', axis=1) 
# we now don't ignore the index in order to preserve the identity of the row we want to merge back to later 
series = pd.concat([series1, series2, series3]) 
# construct a dataframe from the series and give the column a name 
df1 = pd.DataFrame({'series':series}) 
# perform an outer merge on both df's indices 
df.merge(df1, left_index=True, right_index=True, how='outer') 

Out[96]: 
    a b series 
0 a aa a-aaa 
0 a aa a-ccc 
1 b bb bb-bbb 
2 c cc  NaN 
3 d dd  NaN 
+0

我不認爲這有效。 'series2'應該在'index 0'('df ['a'] =='a'')加上'-aaa',而不是'index 1'。 – orange 2014-09-23 13:25:21

+0

你可以更新你的帖子以準確顯示你想要的東西,這將有助於澄清事情,此刻它很容易混淆,知道你期望什麼 – EdChum 2014-09-23 13:31:27

+0

基本上你的問題是你有一個非唯一索引衝突與DF正試圖分配值,我是對的嗎?它看起來像你想要重複一行,並有2行,其中一個用aaa和另一個用ccc,你能確認 – EdChum 2014-09-23 13:35:28