我想計算某個公司在其收入日期的一年內在新聞中出現的次數,並在同一時間框架內比較其他人的次數。我有兩個熊貓數據框,一個是收益日期,另一個是新聞。我的方法很慢。有更好的熊貓/ numpy方式嗎?大熊貓每行加入兩個不同時間範圍的數據幀
import pandas as pd
companies = pd.DataFrame({'CompanyName': ['A', 'B', 'C'], 'EarningsDate': ['2013/01/15', '2015/03/25', '2017/05/03']})
companies['EarningsDate'] = pd.to_datetime(companies.EarningsDate)
news = pd.DataFrame({'CompanyName': ['A', 'A', 'A', 'B', 'B', 'C'],
'NewsDate': ['2012/02/01', '2013/01/10', '2015/05/13' , '2012/05/23', '2013/01/03', '2017/05/01']})
news['NewsDate'] = pd.to_datetime(news.NewsDate)
companies
看起來像
CompanyName EarningsDate
0 A 2013-01-15
1 B 2015-03-25
2 C 2017-05-03
news
看起來像
CompanyName NewsDate
0 A 2012-02-01
1 A 2013-01-10
2 A 2015-05-13
3 B 2012-05-23
4 B 2013-01-03
5 C 2017-05-01
我如何改寫呢?這有效,但是每個數據幀大於500k行非常慢。
company_count = []
other_count = []
for _, company in companies.iterrows():
end_date = company.EarningsDate
start_date = end_date - pd.DateOffset(years=1)
subset = news[(news.NewsDate > start_date) & (news.NewsDate < end_date)]
mask = subset.CompanyName==company.CompanyName
company_count.append(subset[mask].shape[0])
other_count.append(subset[~mask].groupby('CompanyName').size().mean())
companies['12MonCompanyNewsCount'] = pd.Series(company_count)
companies['12MonOtherNewsCount'] = pd.Series(other_count).fillna(0)
最終結果,companies
看起來像
CompanyName EarningsDate 12MonCompanyNewsCount 12MonOtherNewsCount
0 A 2013-01-15 2 2
1 B 2015-03-25 0 0
2 C 2017-05-03 1 0
試試這個:https://stackoverflow.com/questions/22391433/count-the-frequency-that-a-value-occurs-in-a-dataframe-column – RetardedJoker
'value_counts()'在這裏不起作用。我必須加入兩個不同窗口的數據框來進行聚合。 –