的數據幀我有一個DASK數據框,看起來像這樣:機組A DASK數據幀,併產生聚集
url referrer session_id ts customer
url1 ref1 xxx 2017-09-15 00:00:00 a.com
url2 ref2 yyy 2017-09-15 00:00:00 a.com
url2 ref3 yyy 2017-09-15 00:00:00 a.com
url1 ref1 xxx 2017-09-15 01:00:00 a.com
url2 ref2 yyy 2017-09-15 01:00:00 a.com
我想組URL和時間戳,聚合列值的數據,併產生一個數據幀那會看起來是這樣,而不是:
customer url ts page_views visitors referrers
a.com url1 2017-09-15 00:00:00 1 1 [ref1]
a.com url2 2017-09-15 00:00:00 2 2 [ref2, ref3]
火花SQL,我可以做到這一點,如下所示:
select
customer,
url,
ts,
count(*) as page_views,
count(distinct(session_id)) as visitors,
collect_list(referrer) as referrers
from df
group by customer, url, ts
有沒有什麼辦法可以與Dask dataframes做到這一點?我試過,但我只能單獨計算聚合列如下:
# group on timestamp (rounded) and url
grouped = df.groupby(['ts', 'url'])
# calculate page views (count rows in each group)
page_views = grouped.size()
# collect a list of referrer strings per group
referrers = grouped['referrer'].apply(list, meta=('referrers', 'f8'))
# count unique visitors (session ids)
visitors = grouped['session_id'].count()
,但我似乎無法找到以產生組合數據幀,我需要一個好辦法。
有沒有一個很好的方式來做到這一點在熊貓?這種方式是否適用於dask.dataframe? – MRocklin