2017-09-13 57 views
1

此問題引用自this SO Question.如何跟蹤熊貓數據框中以前的日期記錄列?

我想對熊貓數據框執行一些數據分析。我有一個像這樣的數據框:

    derived_symbol sport_name person_name  city \ 
0  football.RAM.mumbai.ram_count football   RAM mumbai 
1  football.RAM.mumbai.mum_count football   RAM mumbai 
2  football.RAM.delhi.mum_count football   RAM  delhi 
3  football.RAM.delhi.ram_count football   RAM  delhi 
4  football.RAM.mumbai.ram_count football   RAM mumbai 
5  football.RAM.mumbai.mum_count football   RAM mumbai 
6  football.RAM.delhi.mum_count football   RAM  delhi 
7  football.RAM.delhi.ram_count football   RAM  delhi 
8  basketball.MAH.pune.mah_count basketball   MAH  pune 
9  basketball.MAH.nagpur.mah_count basketball   MAH nagpur 
10  basketball.MAH.TOTAL.mah_count basketball   MAH No Entry 
11 basketball.MAH.TOTAL.nagpur_count basketball   MAH nagpur 
12 basketball.MAH.TOTAL.pune_count basketball   MAH  pune 
13  football.RAM.TOTAL.delhi_count football   RAM  delhi 
14  football.RAM.TOTAL.delhi_count football   RAM  delhi 
15  football.RAM.TOTAL.mum_count football   RAM No Entry 
16  football.RAM.TOTAL.mum_count football   RAM No Entry 
17 football.RAM.TOTAL.mumbai_count football   RAM mumbai 
18 football.RAM.TOTAL.mumbai_count football   RAM mumbai 
19  football.RAM.TOTAL.ram_count football   RAM No Entry 
20  football.RAM.TOTAL.ram_count football   RAM No Entry 

    person_symbol  month sir person_count 
0   ram 2017-01-23 a   10 
1   mum 2017-01-23 a   14 
2   mum 2017-01-23 a   25 
3   ram 2017-01-23 a   20 
4   ram 2017-02-22 b   34 
5   mum 2017-02-22 b   23 
6   mum 2017-02-22 b   43 
7   ram 2017-02-22 b   34 
8   mah 2017-03-03 c   10 
9   mah 2017-03-03 c   20 
10   mah 2017-03-03 c   30 
11  No Entry 2017-03-03 c   20 
12  No Entry 2017-03-03 c   10 
13  No Entry 2017-01-23 a   45 
14  No Entry 2017-02-22 b   77 
15   mum 2017-01-23 a   39 
16   mum 2017-02-22 b   66 
17  No Entry 2017-01-23 a   24 
18  No Entry 2017-02-22 b   57 
19   ram 2017-01-23 a   30 
20   ram 2017-02-22 b   68 

我想將previous_person_count列添加到此Dataframe。此數據框的「月」列包含格式爲「yyyy-mm-dd」的日期。所以我們需要看一個月,即「mm」字段來確定它是哪個月。

通過查看本月,我們需要將「person_count」值放入下個月的「previous_person_count」值中。

Exceted輸出:

   derived_symbol sport_name person_name  city \ 
0  football.RAM.mumbai.ram_count football   RAM mumbai 
1  football.RAM.mumbai.mum_count football   RAM mumbai 
2  football.RAM.delhi.mum_count football   RAM  delhi 
3  football.RAM.delhi.ram_count football   RAM  delhi 
4  football.RAM.mumbai.ram_count football   RAM mumbai 
5  football.RAM.mumbai.mum_count football   RAM mumbai 
6  football.RAM.delhi.mum_count football   RAM  delhi 
7  football.RAM.delhi.ram_count football   RAM  delhi 
8  basketball.MAH.pune.mah_count basketball   MAH  pune 
9  basketball.MAH.nagpur.mah_count basketball   MAH nagpur 
10  basketball.MAH.TOTAL.mah_count basketball   MAH No Entry 
11 basketball.MAH.TOTAL.nagpur_count basketball   MAH nagpur 
12 basketball.MAH.TOTAL.pune_count basketball   MAH  pune 
13  football.RAM.TOTAL.delhi_count football   RAM  delhi 
14  football.RAM.TOTAL.delhi_count football   RAM  delhi 
15  football.RAM.TOTAL.mum_count football   RAM No Entry 
16  football.RAM.TOTAL.mum_count football   RAM No Entry 
17 football.RAM.TOTAL.mumbai_count football   RAM mumbai 
18 football.RAM.TOTAL.mumbai_count football   RAM mumbai 
19  football.RAM.TOTAL.ram_count football   RAM No Entry 
20  football.RAM.TOTAL.ram_count football   RAM No Entry 

    person_symbol  month sir person_count  previous_person_count 
0   ram 2017-01-23 a   10  0 
1   mum 2017-01-23 a   14  0 
2   mum 2017-01-23 a   25  0 
3   ram 2017-01-23 a   20  0 
4   ram 2017-02-22 b   34  10 
5   mum 2017-02-22 b   23  14 
6   mum 2017-02-22 b   43  25 
7   ram 2017-02-22 b   34  20 
8   mah 2017-03-03 c   10  0 
9   mah 2017-03-03 c   20  0 
10   mah 2017-03-03 c   30  0 
11  No Entry 2017-03-03 c   20  0 
12  No Entry 2017-03-03 c   10  0 
13  No Entry 2017-01-23 a   45  0 
14  No Entry 2017-02-22 b   77  45 
15   mum 2017-01-23 a   39  0 
16   mum 2017-02-22 b   66  39 
17  No Entry 2017-01-23 a   24  0 
18  No Entry 2017-02-22 b   57  24 
19   ram 2017-01-23 a   30  0 
20   ram 2017-02-22 b   68  30 

編輯參考代碼:

df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'], 
      'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'], 
       'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'], 
     'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'], 
     'person_count': ['10','14','25','20','34','23','43','34','10','20'], 
     'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-22','2017-02-22','2017-02-22','2017-02-22','2017-03-03','2017-03-03'], 
     'sir': ['a','a','a','a','b','b','b','b','c','c']}) 

df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']] 

df['person_name'] = df['person_name'].apply(symbology) 

df['person_count'] = df['person_count'].astype(int) 

print df 
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val') 

df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count' 

df2 = df1.groupby(['derived_symbol','month','sir','sport_name','person_name','level_5','val'])['person_count'].sum().reset_index(name='person_count') 

df3 = df2.set_index(['derived_symbol','month','sir','sport_name','person_name','person_count','level_5'])['val'].unstack().fillna('No Entry').rename_axis(None, 1).reset_index() 

df['derived_symbol'] = df['sport_name'] + '.' + df['person_name'] + '.' + df['city'] + "."+ df['person_symbol'] + '_count' 
df4 = pd.concat([df, df3]).reset_index(None) 
print df3 
del df4['index'] 
df4 = df4[['derived_symbol','sport_name','person_name','city','person_symbol','month','sir','person_count']] 
print df4 

便利:

d = {'city': {0: 'mumbai', 
    1: 'mumbai', 
    2: 'delhi', 
    3: 'delhi', 
    4: 'mumbai', 
    5: 'mumbai', 
    6: 'delhi', 
    7: 'delhi', 
    8: 'pune', 
    9: 'nagpur', 
    10: 'No Entry', 
    11: 'nagpur', 
    12: 'pune', 
    13: 'delhi', 
    14: 'delhi', 
    15: 'No Entry', 
    16: 'No Entry', 
    17: 'mumbai', 
    18: 'mumbai', 
    19: 'No Entry', 
    20: 'No Entry'}, 
'derived_symbol': {0: 'football.RAM.mumbai.ram_count', 
    1: 'football.RAM.mumbai.mum_count', 
    2: 'football.RAM.delhi.mum_count', 
    3: 'football.RAM.delhi.ram_count', 
    4: 'football.RAM.mumbai.ram_count', 
    5: 'football.RAM.mumbai.mum_count', 
    6: 'football.RAM.delhi.mum_count', 
    7: 'football.RAM.delhi.ram_count', 
    8: 'basketball.MAH.pune.mah_count', 
    9: 'basketball.MAH.nagpur.mah_count', 
    10: 'basketball.MAH.TOTAL.mah_count', 
    11: 'basketball.MAH.TOTAL.nagpur_count', 
    12: 'basketball.MAH.TOTAL.pune_count', 
    13: 'football.RAM.TOTAL.delhi_count', 
    14: 'football.RAM.TOTAL.delhi_count', 
    15: 'football.RAM.TOTAL.mum_count', 
    16: 'football.RAM.TOTAL.mum_count', 
    17: 'football.RAM.TOTAL.mumbai_count', 
    18: 'football.RAM.TOTAL.mumbai_count', 
    19: 'football.RAM.TOTAL.ram_count', 
    20: 'football.RAM.TOTAL.ram_count'}, 
'month': {0: '2017-01-23', 
    1: '2017-01-23', 
    2: '2017-01-23', 
    3: '2017-01-23', 
    4: '2017-02-22', 
    5: '2017-02-22', 
    6: '2017-02-22', 
    7: '2017-02-22', 
    8: '2017-03-03', 
    9: '2017-03-03', 
    10: '2017-03-03', 
    11: '2017-03-03', 
    12: '2017-03-03', 
    13: '2017-01-23', 
    14: '2017-02-22', 
    15: '2017-01-23', 
    16: '2017-02-22', 
    17: '2017-01-23', 
    18: '2017-02-22', 
    19: '2017-01-23', 
    20: '2017-02-22'}, 
'person_count': {0: 10, 
    1: 14, 
    2: 25, 
    3: 20, 
    4: 34, 
    5: 23, 
    6: 43, 
    7: 34, 
    8: 10, 
    9: 20, 
    10: 30, 
    11: 20, 
    12: 10, 
    13: 45, 
    14: 77, 
    15: 39, 
    16: 66, 
    17: 24, 
    18: 57, 
    19: 30, 
    20: 68}, 
'person_name': {0: 'RAM', 
    1: 'RAM', 
    2: 'RAM', 
    3: 'RAM', 
    4: 'RAM', 
    5: 'RAM', 
    6: 'RAM', 
    7: 'RAM', 
    8: 'MAH', 
    9: 'MAH', 
    10: 'MAH', 
    11: 'MAH', 
    12: 'MAH', 
    13: 'RAM', 
    14: 'RAM', 
    15: 'RAM', 
    16: 'RAM', 
    17: 'RAM', 
    18: 'RAM', 
    19: 'RAM', 
    20: 'RAM'}, 
'person_symbol': {0: 'ram', 
    1: 'mum', 
    2: 'mum', 
    3: 'ram', 
    4: 'ram', 
    5: 'mum', 
    6: 'mum', 
    7: 'ram', 
    8: 'mah', 
    9: 'mah', 
    10: 'mah', 
    11: 'No Entry', 
    12: 'No Entry', 
    13: 'No Entry', 
    14: 'No Entry', 
    15: 'mum', 
    16: 'mum', 
    17: 'No Entry', 
    18: 'No Entry', 
    19: 'ram', 
    20: 'ram'}, 
'sir': {0: 'a', 
    1: 'a', 
    2: 'a', 
    3: 'a', 
    4: 'b', 
    5: 'b', 
    6: 'b', 
    7: 'b', 
    8: 'c', 
    9: 'c', 
    10: 'c', 
    11: 'c', 
    12: 'c', 
    13: 'a', 
    14: 'b', 
    15: 'a', 
    16: 'b', 
    17: 'a', 
    18: 'b', 
    19: 'a', 
    20: 'b'}, 
'sport_name': {0: 'football', 
    1: 'football', 
    2: 'football', 
    3: 'football', 
    4: 'football', 
    5: 'football', 
    6: 'football', 
    7: 'football', 
    8: 'basketball', 
    9: 'basketball', 
    10: 'basketball', 
    11: 'basketball', 
    12: 'basketball', 
    13: 'football', 
    14: 'football', 
    15: 'football', 
    16: 'football', 
    17: 'football', 
    18: 'football', 
    19: 'football', 
    20: 'football'}} 

回答

1

在計算月份編號(從日期起)和上一個月份編號後,您可以執行的操作是merge數據幀。

讓我們從計算這兩個值開始。爲了方便起見,我首先將原始month字符串值轉換爲datetime,這允許我使用relativedelta來計算上個月。這確保了行爲的正確性,即使在一年之後也是如此。

In [7]: df['month'] = pd.to_datetime(df['month']) 

In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m')) 

In [9]: from dateutil.relativedelta import relativedelta 

In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m')) 

In [11]: df 
Out[11]: 
    city  month person_count person_name person_symbol sir sport_name \ 
0 mumbai 2017-01-23   10  ramesh   ram a football 
1 mumbai 2017-01-23   14  ramesh   mum a football 
2 delhi 2017-01-23   25  ramesh   mum a football 
3 delhi 2017-01-23   20  ramesh   ram a football 
4 mumbai 2017-02-22   34  ramesh   ram b football 
5 mumbai 2017-02-22   23  ramesh   mum b football 
6 delhi 2017-02-22   43  ramesh   mum b football 
7 delhi 2017-02-22   34  ramesh   ram b football 
8 pune 2017-03-03   10  mahesh   mah c basketball 
9 nagpur 2017-03-03   20  mahesh   mah c basketball 

    month_num previous_month_num 
0 2017-01   2016-12 
1 2017-01   2016-12 
2 2017-01   2016-12 
3 2017-01   2016-12 
4 2017-02   2017-01 
5 2017-02   2017-01 
6 2017-02   2017-01 
7 2017-02   2017-01 
8 2017-03   2017-02 
9 2017-03   2017-02 

然後,我們可以合併數據幀到自身,用計算出的每月值作爲合併鍵:

In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name'] 

In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele 
    ...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col 
    ...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n 
    ...: um'], axis=1) 
Out[13]: 
    city  month person_count person_name person_symbol sir sport_name \ 
0 mumbai 2017-01-23   10  ramesh   ram a football 
1 mumbai 2017-01-23   14  ramesh   mum a football 
2 delhi 2017-01-23   25  ramesh   mum a football 
3 delhi 2017-01-23   20  ramesh   ram a football 
4 mumbai 2017-02-22   34  ramesh   ram b football 
5 mumbai 2017-02-22   23  ramesh   mum b football 
6 delhi 2017-02-22   43  ramesh   mum b football 
7 delhi 2017-02-22   34  ramesh   ram b football 
8 pune 2017-03-03   10  mahesh   mah c basketball 
9 nagpur 2017-03-03   20  mahesh   mah c basketball 

    person_count_previous 
0      0 
1      0 
2      0 
3      0 
4     10 
5     14 
6     25 
7     20 
8      0 
9      0 

一些評論:

  • 我以前['city', 'person_symbol', 'sport_name']作爲參考列,但隨意添加更多,取決於你想要達到的目標。
  • 新列名爲person_count_previous,但您可以rename它,它應該是最適合您的。
  • 默認情況下,當前一次計數不匹配時,該列將爲NaN。我用0替換了這些值,這要歸功於fillna
  • 我使用drop刪除了「臨時」列,但隨時保留它們。
+0

@ 3kt-謝謝你的幫助。我有一個問題,我們可以在幾周內做同樣的事情嗎?如果數據每週都可用。我們可以根據該數據計算前一週的數量嗎? – kit

+0

@ 3kt-我做到了。謝謝 – kit