2017-09-12 15 views
2

我可以使用一些項目的更多幫助。我正在分析450萬行數據。我已經將數據讀入一個數據框,已經組織了數據,現在有3列:1)日期爲日期時間2)唯一標識符3)價格熊貓從一個非統一的日期列表中找到一年前的日期

我需要計算每年物品價格的年度變化,日期不統一,每個項目也不一致。例如:

date  item price 
12/31/15 A  110 
12/31/15 B  120 
12/31/14 A  100 
6/24/13 B  100 

我想是要找到一個結果是:

date  item price previousdate % change 
12/31/15 A  110 12/31/14  10% 
12/31/15 B  120 6/24/13  20% 
12/31/14 A  100 
6/24/13 B  100 

編輯 - 數據

date item price 
6/1/2016 A 276.3457646 
6/1/2016 B 5.044165645 
4/27/2016 B 4.91300186 
4/27/2016 A 276.4329163 
4/20/2016 A 276.9991265 
4/20/2016 B 4.801263717 
4/13/2016 A 276.1950213 
4/13/2016 B 5.582923328 
4/6/2016 B 5.017863509 
4/6/2016 A 276.218649 
3/30/2016 B 4.64274783 
3/30/2016 A 276.554653 
3/23/2016 B 5.576438253 
3/23/2016 A 276.3135836 
3/16/2016 B 5.394435443 
3/16/2016 A 276.4222986 
3/9/2016 A 276.8929462 
3/9/2016 B 4.999951262 
3/2/2016 B 4.731349423 
3/2/2016 A 276.3972068 
1/27/2016 A 276.8458971 
1/27/2016 B 4.993033132 
1/20/2016 B 5.250379701 
1/20/2016 A 276.2899864 
1/13/2016 B 5.146639666 
1/13/2016 A 276.7041978 
1/6/2016 B 5.328296958 
1/6/2016 A 276.9465891 
12/30/2015 B 5.312301356 
12/30/2015 A 256.259668 
12/23/2015 B 5.279105491 
12/23/2015 A 255.8411198 
12/16/2015 B 5.150798234 
12/16/2015 A 255.8360529 
12/9/2015 A 255.4915183 
12/9/2015 B 4.722876886 
12/2/2015 A 256.267146 
12/2/2015 B 5.083626167 
10/28/2015 B 4.876177757 
10/28/2015 A 255.6464653 
10/21/2015 B 4.551439655 
10/21/2015 A 256.1735769 
10/14/2015 A 255.9752668 
10/14/2015 B 4.693967392 
10/7/2015 B 4.911797443 
10/7/2015 A 256.2556707 
9/30/2015 B 4.262994526 
9/30/2015 A 255.8068691 
7/1/2015 A 255.7312385 
4/22/2015 A 234.6210132 
4/15/2015 A 235.3902076 
4/15/2015 B 4.154926102 
4/1/2015 A 234.4713827 
2/25/2015 A 235.1391496 
2/18/2015 A 235.1223471 

我所做的最好的例子(帶來自其他用戶的一些幫助)沒有工作,但是在下面。感謝您提供的任何幫助或指引我朝着正確的方向發展!

import pandas as pd 
import datetime as dt 
import numpy as np 

df = pd.read_csv('...python test file5.csv',parse_dates =['As of Date']) 

df = df[['item','price','As of Date']] 

def get_prev_year_price(x, df): 
    try: 
     return df.loc[x['prev_year_date'], 'price'] 
     #return np.abs(df.time - x) 
    except Exception as e: 
     return x['price'] 

#Function to determine the closest date from given date and list of all dates 
def nearest(items, pivot): 
    return min(items, key=lambda x: abs(x - pivot)) 

df['As of Date'] = pd.to_datetime(df['As of Date'],format='%m/%d/%Y') 
df = df.rename(columns = {df.columns[2]:'date'}) 

# list of dates 
dtlst = [item for item in df['date']] 

data = [] 
data2 = [] 
for item in df['item'].unique(): 
    item_df = df[df['item'] == item] #select based on items 
    select_dates = item_df['date'].unique() 
    item_df.set_index('date', inplace=True) #set date as key index 

    item_df = item_df.resample('D').mean().reset_index() #fill in missing date 
    item_df['price'] = item_df['price'].interpolate('nearest') #fill in price with nearest price available 
    # use max(item_df['date'] where item_df['date'] < item_df['date'] - pd.DateOffset(years=1, days=1)) 
     #possible_date = item_df['date'] - pd.DateOffset(years=1) 
     #item_df['prev_year_date'] = max(df[df['date'] <= possible_date]) 

    item_df['prev_year_date'] = item_df['date'] - pd.DateOffset(years=1) #calculate 1 year ago date 
    date_df = item_df[item_df.date.isin(select_dates)] #select dates with useful data 
    item_df.set_index('date', inplace=True) 

    date_df['prev_year_price'] = date_df.apply(lambda x: get_prev_year_price(x, item_df),axis=1) 
    #date_df['prev_year_price'] = date_df.apply(lambda x: nearest(dtlst, x),axis=1) 

    date_df['change'] = date_df['price']/date_df['prev_year_price']-1 
    date_df['item'] = item 
    data.append(date_df) 
    data2.append(item_df) 
summary = pd.concat(data).sort_values('date', ascending=False) 
#print (summary) 

#saving the output of the CSV file to see how data looks after being handled 
filename = '...python_test_file_save4.csv' 
summary.to_csv(filename, index=True, encoding='utf-8') 
+0

有沒有在每個項目每年最多一個價格? –

+0

不幸的是,每年至多會有大約50件產品 – CodinglyClueless

+0

您需要精確定義年份的年份意味着什麼 –

回答

1

這是一個很好的局面merge_asof,它通過尋找合適的數據幀小於的關鍵數據框左邊的最後一排合併兩個dataframes。我們需要首先向正確的數據框添加一年,因爲要求是日期之間的差異爲1年或更長。

以下是您在評論中提出的一些示例數據。

date  item price 
12/31/15 A  110 
12/31/15 B  120 
12/31/14 A  100 
6/24/13 B  100 
12/31/15 C  100 
1/31/15 C  80 
11/14/14 C  130 
11/19/13 C  110 
11/14/13 C  200 

日期需要排序merge_asof工作。 merge_asof也會刪除加入列,因此我們需要將其複製到我們正確的數據框中。

設置dataframes

df = df.sort_values('date') 
df_copy = df.copy() 
df_copy['previousdate'] = df_copy['date'] 
df_copy['date'] += pd.DateOffset(years=1) 

使用merge_asof

df_final = pd.merge_asof(df, df_copy, 
         on='date', 
         by='item', 
         suffixes=['current', 'previous']) 
df_final['% change'] = (df_final['pricecurrent'] - df_final['priceprevious'])/df_final['priceprevious'] 
df_final 

     date item pricecurrent priceprevious previousdate % change 
0 2013-06-24 B   100   NaN   NaT  NaN 
1 2013-11-14 C   200   NaN   NaT  NaN 
2 2013-11-19 C   110   NaN   NaT  NaN 
3 2014-11-14 C   130   200.0 2013-11-14 -0.350000 
4 2014-12-31 A   100   NaN   NaT  NaN 
5 2015-01-31 C   80   110.0 2013-11-19 -0.272727 
6 2015-12-31 A   110   100.0 2014-12-31 0.100000 
7 2015-12-31 B   120   100.0 2013-06-24 0.200000 
8 2015-12-31 C   100   130.0 2014-11-14 -0.230769 
+0

哇,這是非常有益的 - 謝謝你的幫助! – CodinglyClueless

2

利用當前的用例假設,這個作品出來這個特定用例

In [2459]: def change(grp): 
     ...:  grp['% change'] = grp.price.diff() 
     ...:  grp['previousdate'] = grp.date.shift(1) 
     ...:  return grp 

排序上date然後groupbyapplychange功能,那麼指數排序回來。

In [2460]: df.sort_values('date').groupby('item').apply(change).sort_index() 
Out[2460]: 
     date item price % change previousdate 
0 2015-12-31 A 110  10.0 2014-12-31 
1 2015-12-31 B 120  20.0 2013-06-24 
2 2014-12-31 A 100  NaN   NaT 
3 2013-06-24 B 100  NaN   NaT 
+0

是的,這適用於這種情況,但不幸的是,我不認爲它會爲實際數據工作,因爲有成千上萬的項目...我想我明白你正在做什麼與日期轉移,這可能是弄清楚這一點的關鍵雖然? – CodinglyClueless

相關問題