2017-05-31 51 views
0

我在火花階新,並希望找到最大的工資在各部門斯卡拉 - GROUPBY和馬克斯在對RDD

Dept,Salary 
Dept1,1000 
Dept2,2000 
Dept1,2500 
Dept2,1500 
Dept1,1700 
Dept2,2800 

我實現下面的代碼

import org.apache.spark.SparkContext 
import org.apache.spark.SparkContext._ 
import org.apache.spark.SparkConf 


object MaxSalary { 
    val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]")) 

    case class Dept(dept_name : String, Salary : Int) 

    val data = sc.textFile("file:///home/user/Documents/dept.txt").map(_.split(",")) 
    val recs = data.map(r => (r(0), Dept(r(0), r(1).toInt))) 
    val a = recs.max()??????? 
}) 
} 

抱住如何實現group by和max函數。我正在使用RDD對。

感謝

回答

1

如果您在此處使用的數據集是解決

case class Dept(dept_name : String, Salary : Int) 


val sc = new SparkContext(new SparkConf().setAppName("Max Salary").setMaster("local[2]")) 

    val sq = new SQLContext(sc) 

    import sq.implicits._ 
    val file = "resources/ip.csv" 

    val data = sc.textFile(file).map(_.split(",")) 

    val recs = data.map(r => Dept(r(0), r(1).toInt)).toDS() 


    recs.groupBy($"dept_name").agg(max("Salary").alias("max_solution")).show() 

輸出:

+---------+------------+ 
|dept_name|max_solution| 
+---------+------------+ 
| Dept2|  2800| 
| Dept1|  2500| 
+---------+------------+ 
+0

得到錯誤'值TODS不是org.apache.spark.rdd.RDD [MaxSalary.Dept]' – Ajay

+0

你用進口spark.implicits._ –

+0

沒有成員..我是否需要寫因爲它返回錯誤'找不到:value sqlContext' – Ajay

5

如果你不希望創建一個數據幀,

val emp = sc.textFile("file:///home/user/Documents/dept.txt").mapPartitionsWithIndex((idx, row) => if(idx==0) row.drop(1) else row).map(x => (x.split(",")(0).toString, x.split(",")(1).toInt)) 
val maxSal = emp.reduceByKey(math.max(_,_)) 

應該給你:

Array[(String, Int)] = Array((Dept1,2500), (Dept2,2800))