熊貓 - 如何控制列順序追加到數據框

我有一個devilishly困難的時間搞清楚如何採取與N行，一個系列與N行，另一個系列與N行，並加入它全部一起。下面是我在做什麼（錯誤）：熊貓 - 如何控制列順序追加到數據框

print df['Survived'].shape    # Series should be 1st column 
print pd.Series(kmeans.labels_).shape # Series should be 2nd column 
print pd.DataFrame(X_pca).shape   # DataFrame should be remaining columns 
new_df = pd.DataFrame() 
new_df['Survived'] = df['Survived'] 
new_df['ClusterId'] = pd.Series(kmeans.labels_) 
new_df = new_df.append(pd.DataFrame(X_pca)) 
print new_df.shape 
print new_df.columns.values

，輸出是：

(1309,) 
(1309,) 
(1309, 9) 
(2618, 11) 
[0L 1L 2L 3L 4L 5L 6L 7L 8L 'ClusterId' 'Survived']

兩件事情我不明白：

列的順序是所有錯誤。我試着從DataFrame開始，然後附加'ClusterId'系列，然後是'Survived'系列，但是生成的DataFrame的列順序完全相同。
附加與DataFrame.append數據幀後，產生的數據幀的行數增加了一倍

我試着閱讀文檔，但我有一個非常艱難的時期發現任何東西覆蓋正是我試圖去做（這似乎並不是一件不尋常的事情）。我也嘗試pd.concat([Series, Series, DataFrame], axis=1)而是拋出一個錯誤：pandas.core.index.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

來源

2014-09-05 Dave Novelli

如果你給什麼樣的實際系列和DataFrames看起來像一些想法這將有助於。從您嘗試到'pd.concat'的錯誤中，我懷疑一個或多個Series/DataFrame的索引與其他索引不匹配。索引很重要！另外，'append'自然會追加* rows *而不是列，fyi。 – Ajean 2014-09-05 22:47:43

讓我們假裝我沒有索引的概念，我只是把它想成原始數據。我從一個DataFrame中的數據集開始。我將很多列轉換爲更少的列，並使用新列計算一個集羣ID。所以我有一列希望與原始數據幀保持一致，並且我希望創建新的羣集ID列，並對新的縮減列集進行轉換。他們都排列在正確的順序（我想這是指數的因素在哪裏？我真的不在乎，他們都在正確的順序，我只是想合併它們 – 2014-09-05 22:52:39

它看起來好像使用'ignore_index = True'將會是一個解決方案（使用concat（）），但是錯誤依然存在。 – 2014-09-05 23:06:51

沒有測試數據調試大熊貓是極其艱苦，但這裏的一些東西，我覺得接近你的步驟工作的例子。

import pandas as pd 
import numpy as np 

df = pd.DataFrame(dict(a=np.random.randn(5), b=np.random.randn(5), 
         c=np.random.randn(5))) 
s1 = df['b']*2 
s1.name = 's1' 
s2 = df['b']/4 
s2.name = 's2' 

new_df = pd.concat([s1, s2, df[['a','c']]], axis=1)

這將產生

  s1  s2   a   c 
0 -2.483036 -0.310379 1.152942 -1.835202 
1 -1.631460 -0.203932 1.299443 0.524964 
2 1.264577 0.158072 -0.324786 -0.006474 
3 -0.547588 -0.068449 -0.754534 -0.002423 
4 0.649246 0.081156 0.003643 -0.375290

如果別的是哪裏錯了，試試看，你從這裏最小的例子有什麼不同。

編輯：爲什麼索引是很重要的一個例證：

In [64]: s1 
Out[64]: 
0 -2.483036 
1 -1.631460 
2 1.264577 
3 -0.547588 
4 0.649246 
Name: s1, dtype: float64 

In [65]: s2 
Out[65]: 
1 -0.310379 
2 -0.203932 
3 0.158072 
4 -0.068449 
5 0.263546 
dtype: float64 

In [66]: print(pd.concat([s1, s2], axis=1)) 
      0   1 
0 -2.483036  NaN 
1 -1.631460 -0.310379 
2 1.264577 -0.203932 
3 -0.547588 0.158072 
4 0.649246 -0.068449 
5  NaN 0.263546

來源

2014-09-05 23:03:34 Ajean

謝謝！我會從這裏開始工作，看看我能不能解決這個問題。我真的不明白什麼索引與它有關，我沒有有意義的索引，並且'ignore_index'參數沒有什麼區別 - 也許是因爲我使用了axis = 1？如果我不能從這裏整理出來，我會提供一些實際的數據 – 2014-09-05 23:10:10

索引很重要，因爲大熊貓是精通數據庫的......'concat'只會將具有相同索引的東西放在一起（即如果你的一個系列行索引爲[0,1,2]，其中一個索引爲[2,3,4]，它將爲您提供索引爲[0,1,2,3,4]的DataFrame，其中的NaN位於沒有重疊的地方） – Ajean 2014-09-05 23:15:10

@ElDuderino在帖子後加了一個插圖 – Ajean 2014-09-05 23:19:40

熊貓 - 如何控制列順序追加到數據框

回答

相關問題