熊貓：連接數據幀併合並相同列的值

我得到了九個不同的數據幀，我想要將它們合併（或合併或更新）爲單個數據幀。這些原始數據幀中的每一個都只包含兩列，以秒爲單位，併爲該觀測值。數據是這樣的：熊貓：連接數據幀併合並相同列的值

Filter_type   Time 
0   0.0 6333.137168 


    Filter_type   Time 
0   0.0 6347.422576 


    Filter_type   Time 
0   0.0 7002.406185 


    Filter_type   Time 
0   0.0 7015.845717 


    Sign_pos_X   Time 
0  11.5 6333.137168 
1  25.0 6347.422576 
2  25.5 7002.406185 
3  38.0 7015.845717 


    Sign_pos_Y   Time 
0  -3.0 6333.137168 
1   8.0 6347.422576 
2  -7.5 7002.406185 
3  -0.5 7015.845717 


    Sign_pos_Z   Time 
0   1.0 6333.137168 
1   1.0 6347.422576 
2   1.0 7002.406185 
3   7.5 7015.845717 


    Supplementary_sign_type   Time 
0      0.0 6333.137168 
1      0.0 6347.422576 
2      0.0 7002.406185 
3      0.0 7015.845717 


      Time vision_only_sign_type 
0 6333.137168     7.0 
1 6347.422576     9.0 
2 7002.406185     9.0 
3 7015.845717     35.0

因爲我希望所有的人都加入到一個單一的數據幀，我試過如下：

df2 = None 

for cell in df['Frames']: 
    if not isinstance(cell, list): 
     continue 

    df_ = pd.DataFrame(cell) 
    if df2 is None: 
     # first iteration 
     df2 = df_ 
     continue 

    df2 = df2.merge(df_, on='Offset', how='outer') 
    #df2 = df2.join(df_) 
    #df2.update(df_, join='outer') 

df2

的問題是，前四個dataframes具有相同值列的名稱，而其他值不是。因此，結果有三列帶有前綴「FILTER_TYPE」：

+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+ 
| | Filter_type_x | Offset | Filter_type_y | Filter_type_x | Filter_type_y | Sign_pos_X | Sign_pos_Y | Sign_pos_Z | Supplementary_sign_type | vision_only_sign_type | 
|----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------| 
| 0 |    0 | 6333.14 |    nan |    nan |    nan |   11.5 |   -3 |   1 |       0 |      7 | 
| 1 |    nan | 6347.42 |    0 |    nan |    nan |   25 |   8 |   1 |       0 |      9 | 
| 2 |    nan | 7002.41 |    nan |    0 |    nan |   25.5 |   -7.5 |   1 |       0 |      9 | 
| 3 |    nan | 7015.85 |    nan |    nan |    0 |   38 |   -0.5 |   7.5 |       0 |      35 | 
+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+

我的問題是：我如何可以強制合併/加入到「FILTER_TYPE」的所有列連接成一個。您可以看到，每行在所有這些列中只有一個值，而其他列爲NaN。結果應該是這樣的（僅具有一個合併列「FILTER_TYPE」）：

+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+ 
| | Offset | Sign_pos_X | Sign_pos_Y | Sign_pos_Z | Supplementary_sign_type | vision_only_sign_type | Filter_type | 
|----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------| 
| 0 | 6333.14 |   11.5 |   -3 |   1 |       0 |      7 |    0 | 
| 1 | 6347.42 |   25 |   8 |   1 |       0 |      9 |    0 | 
| 2 | 7002.41 |   25.5 |   -7.5 |   1 |       0 |      9 |    0 | 
| 3 | 7015.85 |   38 |   -0.5 |   7.5 |       0 |      35 |    0 | 
+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+

來源

2017-10-08 Matthias

調用在一個循環pd.merge導致quadratic copying和性能下降時DataFrames的長度或絕對數量較大。所以儘可能避免這種情況。

在這裏，我們似乎要垂直串聯的DataFrames當他們有Time和Filter_type列，我們希望橫向拼接時DataFrames缺乏Filter_type柱：

frames = [df.set_index('Time') for df in frames] 
filter_type_frames = pd.concat(frames[:4], axis=0) 
result = pd.concat([filter_type_frames] + frames[4:], axis=1) 
result = result.reset_index('Time') 
print(result)

調用pd.concat與axis=0會連接垂直，與水平axis=1。由於pd.concat接受DataFrames的列表，並且可以一次將它們連接在一起，而無需迭代地創建中間DataFrame，因此避免了二次拷貝問題。

由於pd.concat對齊索引，通過將索引設置爲Time，數據根據Time正確對齊。

請參閱下面的可運行示例。

還有另一種方式來解決問題，並在某種程度上，它是漂亮，但它在循環中調用pd.merge，因此它可以從性能低下上述理由解釋受苦。

但是，這個想法是這樣的：默認情況下，pd.merge(left, right)合併在left和right共有的所有列標籤上。所以，如果你省略on='Offset'（或'上=「時間」？），並使用

df2 = df2.merge(df_, how='outer')

然後合併將加入兩個Offset（或Time）和Filter_type如果同時存在。

你可以進一步通過使用

import functools 
df2 = functools.reduce(functools.partial(pd.merge, how='outer'), df['Frames'])

的環隱藏在functools.reduce，但在本質上，pd.merge仍然被稱爲一個循環簡化循環。所以雖然這很漂亮，但它可能不是高性能的。

import functools 
import pandas as pd 
frames = [pd.DataFrame({'Filter_type': [0.0], 'Time': [6333.137168]}), 
      pd.DataFrame({'Filter_type': [0.0], 'Time': [6347.422576]}), 
      pd.DataFrame({'Filter_type': [0.0], 'Time': [7002.406185]}), 
      pd.DataFrame({'Filter_type': [0.0], 'Time': [7015.845717]}), 
      pd.DataFrame({'Sign_pos_X': [11.5, 25.0, 25.5, 38.0], 
         'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}), 
      pd.DataFrame({'Sign_pos_Y': [-3.0, 8.0, -7.5, -0.5], 
         'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}), 
      pd.DataFrame({'Sign_pos_Z': [1.0, 1.0, 1.0, 7.5], 
         'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}), 
      pd.DataFrame({'Supplementary_sign_type': [0.0, 0.0, 0.0, 0.0], 
         'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}), 
      pd.DataFrame({'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717], 
         'vision_only_sign_type': [7.0, 9.0, 9.0, 35.0]})] 

result = functools.reduce(functools.partial(pd.merge, how='outer'), frames) 
print(result) 

frames = [df.set_index('Time') for df in frames] 
A = pd.concat(frames[:4], axis=0) 
result = pd.concat([A] + frames[4:], axis=1) 
result = result.reset_index('Time') 
print(result) 
# same result

打印

Filter_type   Time Sign_pos_X Sign_pos_Y Sign_pos_Z \ 
0   0.0 6333.137168  11.5  -3.0   1.0 
1   0.0 6347.422576  25.0   8.0   1.0 
2   0.0 7002.406185  25.5  -7.5   1.0 
3   0.0 7015.845717  38.0  -0.5   7.5 

    Supplementary_sign_type vision_only_sign_type 
0      0.0     7.0 
1      0.0     9.0 
2      0.0     9.0 
3      0.0     35.0

來源

2017-10-08 17:15:00 unutbu

很不錯的解決方案。同時我也想出了連接第一幀的解決方案。但我真的很喜歡你的減少電話。也會檢查出來！ – Matthias

熊貓：連接數據幀併合並相同列的值

回答

相關問題