加速向大熊貓DataFrame插入空行？

我有數億行的熊貓數據幀，看起來像這樣：加速向大熊貓DataFrame插入空行？

Date  Attribute A Attribute B Value 
01/01/16 A    1    50 
01/05/16 A    1    60 
01/02/16 B    1    59 
01/04/16 B    1    90 
01/10/16 B    1    84

對於每一個獨特的組合（稱爲b）的Attribute A X Attribute B，我需要填寫從起始空日期該唯一組b的最長日期到整個數據幀df中的最大日期。即，因此它看起來像這樣：

Date  Attribute A Attribute B Value 
01/01/16 A    1    50 
01/02/16 A    1    0 
01/03/16 A    1    0 
01/04/16 A    1    0 
01/05/16 A    1    60 
01/02/16 B    1    59 
01/03/16 B    1    0 
01/04/16 B    1    90 
01/05/16 B    1    0 
01/06/16 B    1    0 
01/07/16 B    1    0 
01/08/16 B    1    84

，然後計算變異係數（標準偏差/平均值）爲每個唯一組合的值（0插入後）。我的代碼是這樣的：

final = pd.DataFrame() 
max_date = df['Date'].max() 
for name, group in df.groupby(['Attribute_A','Attribute_B']): 
    idx = pd.date_range(group['Date'].min(), 
         max_date) 

    temp = group.set_index('Date').reindex(idx, fill_value=0) 
    coeff_var = temp['Value'].std()/temp['Value'].mean() 
    final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])

這運行速度非常慢，我正在尋找一種方法來加速它。

對此提出建議？

來源

2017-05-24 user1566200

我不知道如果我的方式比你設置方式快，但這裏有雲：

df = pd.DataFrame({'Date': ['1/1/2016', '1/5/2016', '1/2/2016', '1/4/2016', '1/10/2016'], 
        'Attribute A': ['A', 'A', 'B', 'B', 'B'], 
        'Attribute B': [1, 1, 1, 1, 1], 
        'Value': [50, 60, 59, 90, 84]}) 

unique_attributes = df['Attribute A'].unique() 

groups = [] 
for i in unique_attributes: 
    subset = df[df['Attribute A'] ==i] 
    dates = subset['Date'].tolist() 
    Dates = pd.date_range(dates[0], dates[-1]) 
    subset.set_index('Date', inplace=True) 
    subset.index = pd.DatetimeIndex(subset.index) 
    subset = subset.reindex(Dates) 
    subset['Attribute A'].fillna(method='ffill', inplace=True) 
    subset['Attribute B'].fillna(method='ffill', inplace=True) 
    subset['Value'].fillna(0, inplace=True) 
    groups.append(subset) 

result = pd.concat(groups)

來源

2017-05-25 07:22:42

這將運行出奇的慢，我在尋找一種方法來加速它的向上。建議？

我沒有一個現成的解決方案，但是這是我的建議你解決這個問題：

明白是什麼讓這種緩慢
找到辦法，使關鍵零部件更快
，或者，是一個使用line profiler代碼的分析發現了一種新方法

這裏：

Timer unit: 1e-06 s 

Total time: 0.028074 s 
File: <ipython-input-54-ad49822d490b> 
Function: foo at line 1 

Line #  Hits   Time Per Hit % Time Line Contents 
============================================================== 
    1           def foo(): 
    2   1   875 875.0  3.1  final = pd.DataFrame() 
    3   1   302 302.0  1.1  max_date = df['Date'].max() 
    4   3   3343 1114.3  11.9  for name, group in df.groupby(['Attribute_A','Attribute_B']): 
    5   2   836 418.0  3.0   idx = pd.date_range(group['Date'].min(), 
    6   2   3601 1800.5  12.8        max_date) 
    7           
    8   2   6713 3356.5  23.9   temp = group.set_index('Date').reindex(idx, fill_value=0) 
    9   2   1961 980.5  7.0   coeff_var = temp['Value'].std()/temp['Value'].mean() 
    10   2  10443 5221.5  37.2   final = pd.concat([final, pd.DataFrame({'Attribute_A':[name[0]], 'Attribute_B':[name[1]],'Coeff_Var':[coeff_var]})])

總之，.reindex和concat語句採用的60％的時間。

在我的測量中，第一種節省42％時間的方法是將final數據幀的數據作爲行列表進行收集，並將數據幀創建爲最後一步。像這樣：

newdata = [] 
max_date = df['Date'].max() 
for name, group in df.groupby(['Attribute_A','Attribute_B']): 
    idx = pd.date_range(group['Date'].min(), 
         max_date) 
    temp = group.set_index('Date').reindex(idx, fill_value=0) 
    coeff_var = temp['Value'].std()/temp['Value'].mean() 
    newdata.append({'Attribute_A': name[0], 'Attribute_B': name[1],'Coeff_Var':coeff_var}) 
final = pd.DataFrame.from_records(newdata)

使用timeit來衡量最好的執行時間，我得到

您的解決方案：100 loops, best of 3: 11.5 ms per loop
改進CONCAT：100 loops, best of 3: 6.67 ms per loop

詳細信息請參閱本ipython notebook

注意：您的里程可能有所不同 - 我使用了原始文章中提供的示例數據。您應該在您的真實數據的子集上運行線剖析器 - 關於時間使用的主導因素可能是其他情況。

來源

2017-05-25 08:07:54 miraculixx

加速向大熊貓DataFrame插入空行？

回答

相關問題