2016-02-17 62 views
0

在sparkR我需要計數在數據幀df3如何使用sparkR中的groupBy來計算保持其他列的原樣?

>df3 <- select(df_1,"DC_NAME","STORE", "ITEM_DESC","ITEM") 

>head(df3) 
DC_NAME STORE     ITEM_DESC  ITEM 
1 Kerala 1216 Nambisan Ghee 200 Ml Pet Jar 100050222 
2 Kerala 1216 Nambisan Ghee 100 ml Pet Jar 100149022 
3 Kerala 1216 Nambisan Ghee 50 ml Pet Jar 100149024 
4 Kerala 1219 Nambisan Ghee 500 Ml Pet Jar 100050210 
5 Kerala 1219 Nambisan Ghee 200 Ml Pet Jar 100050222 
6 Kerala 1219 Nambisan Ghee 50 ml Pet Jar 100149024 

For counting number times column STORE occurs, i used the code, 
df_3 <- groupBy(df_3,"STORE") %>% count() 
STORE count 
1 1216  3 
2 1219  3 
3 3154  1 
4 3049  3 
5 1990  3 
6 3107  4 

STORE欄的出現次數但我需要這種形式的結果,包括列'DC_NAME, ITEM_DESC, ITEM'。有沒有任何代碼。

DC_NAME STORE     ITEM_DESC  ITEM count 
1 Kerala 1216 Nambisan Ghee 200 Ml Pet Jar 100050222 3 
2 Kerala 1216 Nambisan Ghee 100 ml Pet Jar 100149022 3 
3 Kerala 1216 Nambisan Ghee 50 ml Pet Jar 100149024 3 
4 Kerala 1219 Nambisan Ghee 500 Ml Pet Jar 100050210 3 
5 Kerala 1219 Nambisan Ghee 200 Ml Pet Jar 100050222 3 
6 Kerala 1219 Nambisan Ghee 50 ml Pet Jar 100149024 3 
+0

只是'與輸入join'骨料。或者使用帶有無限窗口的窗口函數。 – zero323

+0

也可以在'join'中實現,但是在R中可以使用'group_by'來實現。像這樣可以使用sparkR –

+0

不可以。 'join'或窗口函數是唯一的選擇。 – zero323

回答

0

如果你想避免join你可以使用無界範圍內的窗函數。假設你的數據具有以下結構:

df <- structure(list(DC_NAME = structure(c(1L, 1L, 1L, 1L, 1L, 1L), 
    .Label = " Kerala ", class = "factor"), 
    STORE = c(1216L, 1216L, 1216L, 1219L, 1219L, 1219L), 
    ITEM_DESC = structure(c(2L, 
    1L, 4L, 3L, 2L, 4L), .Label = c(" Nambisan Ghee 100 ml Pet Jar", 
    " Nambisan Ghee 200 Ml Pet Jar", " Nambisan Ghee 500 Ml Pet Jar", 
    " Nambisan Ghee 50 ml Pet Jar"), class = "factor"), ITEM = c(100050222L, 
    100149022L, 100149024L, 100050210L, 100050222L, 100149024L 
    )), .Names = c("DC_NAME", "STORE", "ITEM_DESC", "ITEM"), 
    class = "data.frame", row.names = c("1 ", "2 ", "3 ", "4 ", "5 ", "6 ")) 
  • 創建使用蜂巢語境星火DataFrame

    hiveContext <- sparkRHive.init(sc) 
    sdf <- createDataFrame(hiveContext, df) 
    
  • 註冊爲臨時表:

    registerTempTable(sdf, "sdf") 
    
  • 準備查詢:

    query <- "SELECT *, SUM(1) OVER (
        PARTITION BY STORE 
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 
    ) AS count FROM sdf" 
    
  • 使用sql函數執行:

    sql(hiveContext, query) 
    
相關問題