只填寫一個數據幀丟失的值（熊貓）

我有一個數據幀什麼：只填寫一個數據幀丟失的值（熊貓）

email user_name sessions ymo 
[email protected] JD 1 2015-03-01 
[email protected] JD 2 2015-05-01

我需要什麼：

email user_name sessions ymo 
[email protected] JD 0 2015-01-01 
[email protected] JD 0 2015-02-01 
[email protected] JD 1 2015-03-01 
[email protected] JD 0 2015-04-01 
[email protected] JD 2 2015-05-01 
[email protected] JD 0 2015-06-01 
[email protected] JD 0 2015-07-01 
[email protected] JD 0 2015-08-01 
[email protected] JD 0 2015-09-01 
[email protected] JD 0 2015-10-01 
[email protected] JD 0 2015-11-01 
[email protected] JD 0 2015-12-01

ymo列是pd.Timestamp S：

all_ymo 

[Timestamp('2015-01-01 00:00:00'), 
Timestamp('2015-02-01 00:00:00'), 
Timestamp('2015-03-01 00:00:00'), 
Timestamp('2015-04-01 00:00:00'), 
Timestamp('2015-05-01 00:00:00'), 
Timestamp('2015-06-01 00:00:00'), 
Timestamp('2015-07-01 00:00:00'), 
Timestamp('2015-08-01 00:00:00'), 
Timestamp('2015-09-01 00:00:00'), 
Timestamp('2015-10-01 00:00:00'), 
Timestamp('2015-11-01 00:00:00'), 
Timestamp('2015-12-01 00:00:00')]

不幸的是，這個答案：Adding values for missing data combinations in Pandas不好，因爲它會爲現有的ymo值。

我想這樣的事情，但它是非常緩慢：

for em in all_emails: 
    existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo'] 
    existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo]) 
    missing_ymo = list(existent_ymo - all_ymo) 
    multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names) 
    fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index()

來源

2016-09-13 LetMeSOThat4U

如果丟失的條目數量超過填充，然後填充開始用pd.data_range一個新的數據幀。然後在日期匹配的地方添加會話值。如果電子郵件地址和用戶名是1對1，那麼考慮只在數據幀中包含其中一個以節省內存（如果大小是個問題） – dodell

我嘗試用periods創造更多通用的解決方案：

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 


mbeg = pd.period_range('2015-01', periods=12, freq='M') 
print (mbeg) 
PeriodIndex(['2015-01', '2015-02', '2015-03', '2015-04', '2015-05', '2015-06', 
      '2015-07', '2015-08', '2015-09', '2015-10', '2015-11', '2015-12'], 
      dtype='int64', freq='M') 
#convert column ymo to period 
df.ymo = df.ymo.dt.to_period('m') 
#groupby and reindex with filling 0 
df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index()

print (df) 

     email user_name  ymo sessions 
0 [email protected]  JD 2015-01   0 
1 [email protected]  JD 2015-02   0 
2 [email protected]  JD 2015-03   1 
3 [email protected]  JD 2015-04   0 
4 [email protected]  JD 2015-05   2 
5 [email protected]  JD 2015-06   0 
6 [email protected]  JD 2015-07   0 
7 [email protected]  JD 2015-08   0 
8 [email protected]  JD 2015-09   0 
9 [email protected]  JD 2015-10   0 
10 [email protected]  JD 2015-11   0 
11 [email protected]  JD 2015-12   0 
12 [email protected]  AB 2015-01   0 
13 [email protected]  AB 2015-02   0 
14 [email protected]  AB 2015-03   1 
15 [email protected]  AB 2015-04   0 
16 [email protected]  AB 2015-05   2 
17 [email protected]  AB 2015-06   0 
18 [email protected]  AB 2015-07   0 
19 [email protected]  AB 2015-08   0 
20 [email protected]  AB 2015-09   0 
21 [email protected]  AB 2015-10   0 
22 [email protected]  AB 2015-11   0 
23 [email protected]  AB 2015-12   0

然後，如果需要datetimes使用to_timestamp：

df.ymo = df.ymo.dt.to_timestamp() 
print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0

解決方案與日期時間：

print (df) 
    email user_name sessions  ymo 
0 [email protected]  JD   1 2015-03-01 
1 [email protected]  JD   2 2015-05-01 
2 [email protected]  AB   1 2015-03-01 
3 [email protected]  AB   2 2015-05-01 

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df = df.groupby(['email','user_name']) 
     .apply(lambda x: x.set_index('ymo') 
     .reindex(mbeg, fill_value=0) 
     .drop(['email','user_name'], axis=1)) 
     .rename_axis(('email','user_name','ymo')) 
     .reset_index()

print (df) 
     email user_name  ymo sessions 
0 [email protected]  JD 2015-01-01   0 
1 [email protected]  JD 2015-02-01   0 
2 [email protected]  JD 2015-03-01   1 
3 [email protected]  JD 2015-04-01   0 
4 [email protected]  JD 2015-05-01   2 
5 [email protected]  JD 2015-06-01   0 
6 [email protected]  JD 2015-07-01   0 
7 [email protected]  JD 2015-08-01   0 
8 [email protected]  JD 2015-09-01   0 
9 [email protected]  JD 2015-10-01   0 
10 [email protected]  JD 2015-11-01   0 
11 [email protected]  JD 2015-12-01   0 
12 [email protected]  AB 2015-01-01   0 
13 [email protected]  AB 2015-02-01   0 
14 [email protected]  AB 2015-03-01   1 
15 [email protected]  AB 2015-04-01   0 
16 [email protected]  AB 2015-05-01   2 
17 [email protected]  AB 2015-06-01   0 
18 [email protected]  AB 2015-07-01   0 
19 [email protected]  AB 2015-08-01   0 
20 [email protected]  AB 2015-09-01   0 
21 [email protected]  AB 2015-10-01   0 
22 [email protected]  AB 2015-11-01   0 
23 [email protected]  AB 2015-12-01   0

來源

2016-09-13 10:25:06 jezrael

生成本月開始日期和reindex
ffill和bfill列['email', 'user_name']
fillna(0)列'sessions'

mbeg = pd.date_range('2015-01-31', periods=12, freq='M') - pd.offsets.MonthBegin() 

df1 = df.set_index('ymo').reindex(mbeg) 

df1[['email', 'user_name']] = df1[['email', 'user_name']].ffill().bfill() 
df1['sessions'] = df1['sessions'].fillna(0).astype(int) 

df1

來源

2016-09-13 09:53:39 piRSquared

不幸的是，如果存在具有相同日期的其他用戶的行，則不起作用df在jezrael的答案中），這引起了「ValueError：不能從重複軸重新索引」。 – LetMeSOThat4U

只填寫一個數據幀丟失的值（熊貓）

回答

相關問題