我有一個數據幀什麼:只填寫一個數據幀丟失的值(熊貓)
email user_name sessions ymo
[email protected] JD 1 2015-03-01
[email protected] JD 2 2015-05-01
我需要什麼:
email user_name sessions ymo
[email protected] JD 0 2015-01-01
[email protected] JD 0 2015-02-01
[email protected] JD 1 2015-03-01
[email protected] JD 0 2015-04-01
[email protected] JD 2 2015-05-01
[email protected] JD 0 2015-06-01
[email protected] JD 0 2015-07-01
[email protected] JD 0 2015-08-01
[email protected] JD 0 2015-09-01
[email protected] JD 0 2015-10-01
[email protected] JD 0 2015-11-01
[email protected] JD 0 2015-12-01
ymo
列是pd.Timestamp
S:
all_ymo
[Timestamp('2015-01-01 00:00:00'),
Timestamp('2015-02-01 00:00:00'),
Timestamp('2015-03-01 00:00:00'),
Timestamp('2015-04-01 00:00:00'),
Timestamp('2015-05-01 00:00:00'),
Timestamp('2015-06-01 00:00:00'),
Timestamp('2015-07-01 00:00:00'),
Timestamp('2015-08-01 00:00:00'),
Timestamp('2015-09-01 00:00:00'),
Timestamp('2015-10-01 00:00:00'),
Timestamp('2015-11-01 00:00:00'),
Timestamp('2015-12-01 00:00:00')]
不幸的是,這個答案:Adding values for missing data combinations in Pandas不好,因爲它會爲現有的ymo
值。
我想這樣的事情,但它是非常緩慢:
for em in all_emails:
existent_ymo = fill_ymo[fill_ymo['email'] == em]['ymo']
existent_ymo = set([pd.Timestamp(datetime.date(t.year, t.month, t.day)) for t in existent_ymo])
missing_ymo = list(existent_ymo - all_ymo)
multi_ind = pd.MultiIndex.from_product([[em], missing_ymo], names=col_names)
fill_ymo = sessions.set_index(col_names).reindex(multi_ind, fill_value=0).reset_index()
如果丟失的條目數量超過填充,然後填充開始用pd.data_range一個新的數據幀。然後在日期匹配的地方添加會話值。如果電子郵件地址和用戶名是1對1,那麼考慮只在數據幀中包含其中一個以節省內存(如果大小是個問題) – dodell