Spark request max count

我是一名初學者，我嘗試提出請求讓我檢索訪問量最大的網頁。Spark request max count

我的要求是以下

mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()

有了這個請求，我只檢索與最大計數一個數據幀，但我想檢索與此分數和網頁保存一個數據幀這個分數

類似的東西：

webPage   max(cntWebPage) 
google.com   2

我該如何解決我的問題？

非常感謝。

來源

2016-11-26 JackR

在pyspark + SQL：

logDF.registerTempTable("logDF") 

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
              select webPage, count(*) as cntWebPage, max(count(*)) over() as maxcnt 
              from logDF 
              group by webPage) as tmp 
              where tmp.cntWebPage = tmp.maxcnt""")

也許我可以使它更清潔，但它的作品。我會盡力優化它。

我的結果：

webPage  cntWebPage 
google.com 2

的數據集：

webPage usersid 
google.com 1 
google.com 3 
bing.com 10

說明：正常計數是通過分組+ COUNT（*）函數來完成。所有這些計數的最大通過窗函數計算，所以以上數據集，即時數據幀/不失MAXCOUNT列/是：

webPage count maxCount 
google.com 2  2 
bing.com 1  2

然後我們選擇具有計數等於MAXCOUNT

編輯行：我有刪除DSL版本 - 它不支持window over（）和排序正在改變結果。對不起，這個錯誤。 SQL版本是正確的

來源

2016-11-26 12:34:30

非常感謝您的幫助:) – JackR

@JackR如果它對您有幫助，請將uptove +標記爲接受:) –

我對此投票，因爲OP似乎對如何處理事情毫無頭緒。 :) – eliasah

Spark request max count

回答

相關問題