2016-09-13 116 views
1

我有一個數據幀什麼:只填寫一個數據幀丟失的值(熊貓)

email user_name sessions ymo 
[email protected] JD 1 2015-03-01 
[email protected] JD 2 2015-05-01 

我需要什麼:

email user_name sessions ymo 
[email protected] JD 0 2015-01-01 
[email protected] JD 0 2015-02-01 
[email protected] JD 1 2015-03-01 
[email protected] JD 0 2015-04-01 
[email protected] JD 2 2015-05-01 
[email protected] JD 0 2015-06-01 
[email protected] JD 0 2015-07-01 
[email protected] JD 0 2015-08-01 
[email protected] JD 0 2015-09-01 
[email protected] JD 0 2015-10-01 
[email protected] JD 0 2015-11-01 
[email protected] JD 0 2015-12-01 

ymo列是pd.Timestamp S:

all_ymo 

[Timestamp('2015-01-01 00:00:00'), 
Timestamp('2015-02-01 00:00:00'), 
Timestamp('2015-03-01 00:00:00'), 
Timestamp('2015-04-01 00:00:00'), 
Timestamp('2015-05-01 00:00:00'), 
Timestamp('2015-06-01 00:00:00'), 
Timestamp('2015-07-01 00:00:00'), 
Timestamp('2015-08-01 00:00:00'), 
Timestamp('2015-09-01 00:00:00'), 
Timestamp('2015-10-01 00:00:00'), 
Timestamp('2015-11-01 00:00:00'), 
Timestamp('2015-12-01 00:00:00')] 

不幸的是,這個答案:Adding values for missing data combinations in Pandas不好,因爲它會爲現有的ymo值。

我想這樣的事情,但它是非常緩慢:

for em in all_emails: 
    existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo'] 
    existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo]) 
    missing_ymo = list(existent_ymo - all_ymo) 
    multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names) 
    fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index() 
+0

如果丟失的條目數量超過填充,然後填充開始用pd.data_range一個新的數據幀。然後在日期匹配的地方添加會話值。如果電子郵件地址和用戶名是1對1,那麼考慮只在數據幀中包含其中一個以節省內存(如果大小是個問題) – dodell

回答

2

我嘗試用periods創造更多通用的解決方案:

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 


mbeg = pd.period_range('2015-01', periods=12, freq='M') 
print (mbeg) 
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06', 
      '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'], 
      dtype='int64', freq='M') 
#convert column ymo to period 
df.ymo = df.ymo.dt.to_period('m') 
#groupby and reindex with filling 0 
df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index() 
print (df) 

     email user_name  ymo sessions 
0 [email protected]  JD 2015-01   0 
1 [email protected]  JD 2015-02   0 
2 [email protected]  JD 2015-03   1 
3 [email protected]  JD 2015-04   0 
4 [email protected]  JD 2015-05   2 
5 [email protected]  JD 2015-06   0 
6 [email protected]  JD 2015-07   0 
7 [email protected]  JD 2015-08   0 
8 [email protected]  JD 2015-09   0 
9 [email protected]  JD 2015-10   0 
10 [email protected]  JD 2015-11   0 
11 [email protected]  JD 2015-12   0 
12 [email protected]  AB 2015-01   0 
13 [email protected]  AB 2015-02   0 
14 [email protected]  AB 2015-03   1 
15 [email protected]  AB 2015-04   0 
16 [email protected]  AB 2015-05   2 
17 [email protected]  AB 2015-06   0 
18 [email protected]  AB 2015-07   0 
19 [email protected]  AB 2015-08   0 
20 [email protected]  AB 2015-09   0 
21 [email protected]  AB 2015-10   0 
22 [email protected]  AB 2015-11   0 
23 [email protected]  AB 2015-12   0 

然後,如果需要datetimes使用to_timestamp

df.ymo = df.ymo.dt.to_timestamp() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0 

解決方案與日期時間:

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0 
2
  • 生成本月開始日期和reindex
  • ffillbfill['email', 'user_name']
  • fillna(0)'sessions'

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df1 = df.set_index('ymo').reindex(mbeg) 

df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill() 
df1['sessions'] = df1['sessions'].fillna(0).astype(int) 

df1 

enter image description here

+0

不幸的是,如果存在具有相同日期的其他用戶的行,則不起作用df在jezrael的答案中),這引起了「ValueError:不能從重複軸重新索引」。 – LetMeSOThat4U