創建一個單一的數據幀進行兩列各包含列表

我有一個看起來像這樣的文件：創建一個單一的數據幀進行兩列各包含列表

Location Code Trait ID Effective Date 
WAU1 23984,24896,27576 06/05/2014 ,06/05/2014 ,06/12/2014 
WAU2 126973,219332 06/05/2014 ,06/05/2014 
WAU3 24375 06/05/2014 
WAU4 23984 06/05/2014 
WAU5 5199,23984 NULL 
WAU6 12342,224123 06/05/2014

公告第2和第3列是如何一個值的「名單」。某些行包含每個列表中元素數量的完全匹配，其他行則缺失或根本不存在（空）。我需要創建一個單獨的數據幀是很像以下

Location Code Trait ID Effective Date 
     0 WAU1 23984 06/05/2014 
     1 WAU1 24896 06/05/2014 
     2 WAU1 27576 06/12/2014 
     3 WAU2 126973 06/05/2014 
     4 WAU2 219332 06/05/2014 
     5 WAU3 24375 06/05/2014 
     6 WAU4 23984 06/05/2014 
     7 WAU5 5199 NaN 
     8 WAU5 23984 NaN 
     9 WAU6 12342 06/05/2014 
     10 WAU6 224123 NaN

我已經能夠給每個「目錄」列闖入使用單獨dataframes如下：

df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack() 
df1.index = df1.index.droplevel(-1) 
df1.name = 'Trait ID' 
del df1['Trait ID'] 
df1 = df1.join(trait_id)

其中給出我是這樣的：

Location Code Trait ID 
0   WAU1 23984 
0   WAU1 24896 
0   WAU1 27576 
1   WAU2 126973 
1   WAU2 219332 
2   WAU3 24375 
3   WAU4 23984 
4   WAU5  5199 
4   WAU5 23984 
5   WAU6 12342 
5   WAU6 224123

，我還可以用上述相同的邏輯來產生以下的「有效日期」列表創建另一個數據框：

Location Code Effective Date 
0   WAU1 06/05/2014 
0   WAU1 06/05/2014 
0   WAU1 06/12/2014 
1   WAU2 06/05/2014 
1   WAU2 06/05/2014 
2   WAU3 06/05/2014 
3   WAU4 06/05/2014 
4   WAU5   NaN 
5   WAU6 06/05/2014

我很努力地在熊貓中找到合適的「函數」（例如join，merge，concat）來將兩個數據框合併到我想要的輸出中。雖然我感覺它是它們的組合，並且在那裏有一個reset_index（）。

來源

2016-01-29 Jeff Pipas

數據源是什麼類型的「文件」？什麼是分隔符（逗號，管道，標籤）？它是否偶爾錯過這樣的逗號？我甚至可以問這個數據源（HTML，XML，RDMS等）的來源在哪裏？ – Parfait

它是一個製表符分隔的文件，在文件的第2和第3列中是由逗號分隔的一串值。如果第二列的值的「原始」索引有匹配的元素（如果這是有意義的），我需要將第二列分割成行，然後將第三列「附加」到行中。否則，那個外推行，應該得到一個南/空，等等。 –

與開始：

Location Code    Trait ID     Effective Date 
0   WAU1 23984, 24896, 27576 06/05/2014,06/05/2014,06/12/2014 
1   WAU2  126973, 219332    06/05/2014,06/05/2014 
2   WAU3    24375    2014-06-05 00:00:00 
3   WAU4    23984    2014-06-05 00:00:00 
4   WAU5   5199, 23984        NaN 
5   WAU6  12342, 224123    2014-06-05 00:00:00

你可以groupby('Location Code')，使用str.split(',') with擴大=真, pivot the result using棧（）and concat`每個組：

df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)

獲得：

     0     1 
Location Code        
WAU1    23984   06/05/2014 
WAU1    24896   06/05/2014 
WAU1    27576   06/12/2014 
WAU2   126973   06/05/2014 
WAU2   219332   06/05/2014 
WAU3    24375 2014-06-05 00:00:00 
WAU4    23984 2014-06-05 00:00:00 
WAU5    5199     nan 
WAU5    23984     NaN 
WAU6    12342 2014-06-05 00:00:00 
WAU6   224123     NaN

來源

2016-01-29 21:18:26 Stefan

我認爲這是詭計！我將它應用到更大的文件中，看看我是否錯過了任何東西，但我認爲我的上面的測試用例涵蓋了所有的「用例」。感謝您的幫助和快速回復！ –

創建一個單一的數據幀進行兩列各包含列表

回答

相關問題