爲pandas中的多索引數據框填充所有月份

我有一張表，其中包含2015年至2017年每月數千種產品的銷售額和預測。我的數據給出了需求&根據每個網站，類型，產品和日期）爲pandas中的多索引數據框填充所有月份

的問題是，如果沒有銷售&沒有預測在一個月內我沒有看到具體的線路。在下面的示例中，您會看到「2015-08-31」行缺失。我希望看到該行的需求爲0，預測爲0. （請參閱下面的df_expected示例）。

基本上我想和0來填補這個表2015年6月30日之間的所有日期二零一七年九月三十零日所有產品/類型/網站的組合。

正如你在我還沒有定義的任何指數的代碼中看到，但基本上[簡稱「網站」，「類型」，「產品」，「日期」]可以被看作是多指標。

請注意，我有幾百萬行。

import pandas as pd 
data = [("W1","G1",1234,pd.to_datetime("2015-07-31"),8,4), 
     ("W1","G1",1234,pd.to_datetime("2015-09-30"),2,4), 
     ("W1","G1",1234,pd.to_datetime("2015-10-31"),2,4), 
     ("W1","G1",1234,pd.to_datetime("2015-11-30"),4,4), 
     ("W1","G2",2345,pd.to_datetime("2015-07-31"),5,0), 
     ("W1","G2",2345,pd.to_datetime("2015-08-31"),1,3), 
     ("W1","G2",2345,pd.to_datetime("2015-10-31"),1,3), 
     ("W1","G2",2345,pd.to_datetime("2015-11-30"),3,3)] 
labels = ["Site","Type","Product","Date","Demand","Forecast"] 
df = pd.DataFrame(data,columns=labels) 
df 

    Site Type Product  Date Demand Forecast 
0 W1 G1  1234 2015-07-31  8   4 
1 W1 G1  1234 2015-09-30  2   4 
2 W1 G1  1234 2015-10-31  2   4 
3 W1 G1  1234 2015-11-30  4   4 
4 W1 G2  2345 2015-07-31  5   0 
5 W1 G2  2345 2015-08-31  1   3 
6 W1 G2  2345 2015-10-31  1   3 
7 W1 G2  2345 2015-11-30  3   3

這是我期待

data_expected = [("W1","G1",1234,pd.to_datetime("2015-07-31"),8,4), 
       ("W1","G1",1234,pd.to_datetime("2015-08-31"),0,0), 
       ("W1","G1",1234,pd.to_datetime("2015-09-30"),2,4),   
       ("W1","G1",1234,pd.to_datetime("2015-10-31"),2,4), 
       ("W1","G1",1234,pd.to_datetime("2015-11-30"),4,4)] 
df_expected = pd.DataFrame(data_expected,columns=labels) 
df_expected 

    Site Type Product  Date Demand Forecast 
0 W1 G1  1234 2015-07-31  8   4 
1 W1 G1  1234 2015-08-31  0   0 
2 W1 G1  1234 2015-09-30  2   4 
3 W1 G1  1234 2015-10-31  2   4 
4 W1 G1  1234 2015-11-30  4   4

我本來想堆結果/拆散，以確保我有所有月份。但對於有數百萬行的數據幀來說，這不是最佳選擇。

df = (df 
     .set_index("Date") 
     .groupby(["Site","Product","Type",pd.TimeGrouper('M')])[["Forecast","Demand"]].sum() 
     .unstack() 
     .fillna(0) 
     .astype(int))

你覺得呢？

來源

2017-09-26 Nicolas

您可以使用DataFrameGroupBy.resample與asfreq：

df = (df.set_index('Date') 
     .groupby(["Site","Type","Product"])['Demand','Forecast'] 
     .resample('M') 
     .asfreq() 
     .fillna(0) 
     .astype(int) 
     .reset_index()) 
print (df) 
    Site Type Product  Date Demand Forecast 
0 W1 G1  1234 2015-07-31  8   4 
1 W1 G1  1234 2015-08-31  0   0 
2 W1 G1  1234 2015-09-30  2   4 
3 W1 G1  1234 2015-10-31  2   4 
4 W1 G1  1234 2015-11-30  4   4

編輯：

我試了一下改善與fill_value參數原液在unstack：

(df.set_index("Date") 
    .groupby(["Site","Product","Type",pd.TimeGrouper('M')])['Dem‌and','Forecast'].sum‌() 
    .unstack(fill_value=0) 
    .stack())

來源

2017-09-26 12:57:17 jezrael

看來，我的解決方案堆棧/斯塔克更快。用你的技術，它適用於有10,000行的df。但是，如果你在100萬行上運行它需要一個looooong時間（我從來沒有看到實際的解決方案） – Nicolas

現在我明白了。我只能改進你的解決方案 - '（df.set_index（「Date」） .groupby（[「Site」，「Product」，「Type」，pd.TimeGrouper（'M'）]）''''''' 。預測「]和（） .unstack（fill_value = 0） .STACK（））' - 它是你的真實數據的速度更快？如果是，我可以將其添加到我的答案中。 – jezrael

是的，這是非常快的這個堆棧/堆棧 – Nicolas

堆棧/拆散做法似乎工作速度更快。與這個所有項目有相同的開始日期和結束日期

df = (df.set_index("Date") 
     .groupby(["Site","Product","Type",pd.TimeGrouper('M')])['Demand','Forecast'].sum() 
     .unstack() 
     .fillna(0) 
     .astype(int) 
     .stack()) 


           Demand Forecast 
Site Product Type Date       
W1 1234 G1 2015-07-31  8   4 
        2015-08-31  0   0 
        2015-09-30  2   4 
        2015-10-31  2   4 
        2015-11-30  4   4 
    2345 G2 2015-07-31  5   0 
        2015-08-31  1   3 
        2015-09-30  0   0 
        2015-10-31  1   3 
        2015-11-30  3   3

來源

2017-09-26 14:54:46 Nicolas

您可以輸出包含測試數據的解決方案嗎？對我來說，它返回與原始相同的數據幀。也許熊貓版本的問題，我使用熊貓0.20.3。 – jezrael

所以我理解你的問題。這裏的問題是，如果僅填充4行數據，則此特定解決方案將無法工作，因爲它不會佔用任何額外的月份。但是，如果你在1米線上運行，那麼它會得到所有正確的月份。我會改變最初的數據來反映這一點。 – Nicolas

爲pandas中的多索引數據框填充所有月份

回答

相關問題