2017-03-18 154 views
0

我有一個與(city, person_id, number)和每個城市我想找到人數最高的RDD。我的第一個想法是使用reduceByKey和城市作爲鍵值(rdd.reduce((num1, num2) => Math.max(num1, num2))),但我不知道如何在進程中保留person_id。節省火花時減少火花(斯卡拉)

回答

0

您需要將您的RDD轉換爲PairRdd,那麼你就可以reduceByKey並保持人與最大數量

rdd.map { case (city, person_id, number) => (city, (person_id, number)) }. 
     reduceByKey { 
     case ((person_id1, n1), (person_id2, n2)) => 
      if (n1 > n2) 
      (person_id1, n1) 
      else 
      (person_id2, n2) 
     }.map { 
     case (city, (person_id, number)) => (city, person_id) 
    }