2016-12-29 70 views
0

我正在做一些基本的使用scala的spark。爲什麼計數函數不適用於Spark中的mapvalues?

我想知道爲什麼計數功能不與mapValues地圖功能

工作當我申請總和,最小值,最大值然後它的作品。也有沒有什麼地方我可以參考groupbykeyRDD中可應用於Iterable [String]上的所有適用函數?

mycode的:

scala> val records = List("CHN|2", "CHN|3" , "BNG|2","BNG|65") 
records: List[String] = List(CHN|2, CHN|3, BNG|2, BNG|65) 

scala> val recordsRDD = sc.parallelize(records) 
recordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[119] at parallelize at <console>:23 

scala> val mapRDD = recordsRDD.map(elem => elem.split("\\|")) 
mapRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[120] at map at <console>:25 

scala> val keyvalueRDD = mapRDD.map(elem => (elem(0),elem(1))) 
keyvalueRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at map at <console>:27 

scala> val groupbykeyRDD = keyvalueRDD.groupByKey() 
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[String])] = ShuffledRDD[122] at groupByKey at <console>:29 

scala> groupbykeyRDD.mapValues(elem => elem.count).collect 
<console>:32: error: missing arguments for method count in trait TraversableOnce; 
follow this method with `_' if you want to treat it as a partially applied function 
      groupbykeyRDD.mapValues(elem => elem.count).collect 
              ^

scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect 
<console>:32: error: missing arguments for method count in trait TraversableOnce; 
follow this method with `_' if you want to treat it as a partially applied function 
      groupbykeyRDD.map(elem => (elem._1 ,elem._2.count)).collect 

預期輸出:

Array((CHN,2) ,(BNG,2)) 

回答

1

您有錯誤無關火花,這是一個純粹的Scala編譯錯誤。

您可以在嘗試階(無火花可言)控制檯:

scala> val iterableTest: Iterable[String] = Iterable("test") 
iterableTest: Iterable[String] = List(test) 

scala> iterableTest.count 
<console>:29: error: missing argument list for method count in trait TraversableOnce 

這是因爲Iterable does not define a count (with no arguments) method。它確實定義了一個count方法,但是它需要一個謂詞函數參數,這就是爲什麼你會得到關於部分未應用函數的特定錯誤。

雖然它有一個size方法,您可以交換您的示例使其工作。

+0

我根據你的答案應用了大小方法,它工作。 –

1

ELEM你得到的類型是可迭代[字符串]的然後嘗試長法或尺寸的方法,因爲可迭代沒有計數方法,如果它不工作 你可以投Iteratable [String]列出,並嘗試長度的方法

計數方法avalaible爲RDD

1

count - 計算參數條件提供值的出現(布爾)

計數與您的代碼:這裏計算的 「2」, 「3」 出現的#

scala> groupbykeyRDD.collect().foreach(println) 
(CHN,CompactBuffer(2, 3)) 
(BNG,CompactBuffer(2, 65)) 

scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "2"))).collect 
res14: Array[(String, Int)] = Array((CHN,1), (BNG,1)) 

scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == "3"))).collect 
res15: Array[(String, Int)] = Array((CHN,1), (BNG,0)) 

count with with small fix to your code:如果你扭動你的代碼相比,這樣算應該給您預期的結果:

val keyvalueRDD = mapRDD.map(elem => (elem(0),1))

測試:

scala> val groupbykeyRDD = mapRDD.map(elem => (elem(0),1)).groupByKey() 
groupbykeyRDD: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[9] at groupByKey at <console>:18 

scala> groupbykeyRDD.collect().foreach(println) 
(CHN,CompactBuffer(1, 1)) 
(BNG,CompactBuffer(1, 1)) 

scala> groupbykeyRDD.map(elem => (elem._1 ,elem._2.count(_ == 1))).collect 
res18: Array[(String, Int)] = Array((CHN,2), (BNG,2)) 
+0

很好的解釋!從你的答案我明白,我們可以通過傳遞函數給它返回布爾值使用計數。非常感謝。 –

+0

@SurenderRaja - 酷 - 隨時投票! :) –

相關問題