我只想保留在第二個表中引用了部門ID的員工。基於Spark中的另一個RDD進行過濾
Employee table
LastName DepartmentID
Rafferty 31
Jones 33
Heisenberg 33
Robinson 34
Smith 34
Department table
DepartmentID
31
33
我曾嘗試下面的代碼不工作:
employee = [['Raffery',31], ['Jones',33], ['Heisenberg',33], ['Robinson',34], ['Smith',34]]
department = [31,33]
employee = sc.parallelize(employee)
department = sc.parallelize(department)
employee.filter(lambda e: e[1] in department).collect()
Py4JError: An error occurred while calling o344.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
任何想法?我在Python中使用Spark 1.1.0。但是,我會接受一個Scala或Python的答案。
你需要你的部門名單是一個「中」 RDD? – maasg 2014-10-06 18:01:43
不是。部門列表從HDFS加載,但不是很大。 – poiuytrez 2014-10-07 07:52:07