在谷歌雲數據流中使用CombinePerKey Python

我試圖運行一個簡單的數據流Python管道，它從BigQuery獲取特定的用戶事件並生成每用戶事件計數。在谷歌雲數據流中使用CombinePerKey Python

p = df.Pipeline(argv=pipeline_args) 
result_query = "..." 
data = p | df.io.Read(df.io.BigQuerySource(query=result_query)) 
user_events = data|df.Map(lambda x: (x['users_user_id'], 1)) 
user_event_counts = user_events|df.CombinePerKey(sum)

運行這給了我一個錯誤：CombinePerKey前

TypeError: Expected tuple, got int [while running 'Map(<lambda at user_stats.py:...>)']

數據轉化爲這種形式：

(u'55107178236374', 1) 
(u'55107178236374', 1) 
(u'55107178236374', 1) 
(u'2296845644499670', 1) 
(u'2296845644499670', 1) 
(u'1489727796186326', 1) 
(u'1489727796186326', 1) 
(u'1489727796186326', 1) 
(u'1489727796186326', 1)

相反，如果計算user_event_counts本：

user_event_counts = (user_events|df.GroupByKey()| 
    df.Map('count', lambda (user, ones): (user, sum(ones))))

然後沒有錯誤，我得到了我期望的結果。

基於docs我會期待這兩種方法的類似行爲。我明顯錯過了CombinePerKey，但我看不到它是什麼。任何提示讚賞！

來源

2016-05-16 numentar

我猜你運行的SDK版本低於0.2.4。這是我們在某些情況下如何處理組合操作的一個錯誤。該問題已通過最新版本的SDK（v0.2.4）修復：https://github.com/GoogleCloudPlatform/DataflowPythonSDK/releases/tag/v0.2.4 對不起。如果您仍然遇到最新版本的問題，請告知我們。

來源

2016-05-17 16:26:01 Silviu

在谷歌雲數據流中使用CombinePerKey Python

回答

相關問題