2015-10-16 129 views
3

我有一個RDD與元組的形式是:PySpark - 轉換的RDD成一個鍵值對RDD,與值列表是

[("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2"), ... 

我想是要變換成關鍵 - 值對RDD,其中,所述第一場將首先串(鍵)和第二場字符串(值)的列表,即,欲把它轉化爲以下形式:

[("a1",["b1","c1","d1","e1"]), ("a2",["b2","c2","d2","e2"]), ... 

回答

6
>>> rdd = sc.parallelize([("a1","b1","c1","d1","e1"), ("a2","b2","c2","d2","e2")]) 

>>> result = rdd.map(lambda x: (x[0], list(x[1:]))) 

>>> print result.collect() 
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b2', 'c2', 'd2', 'e2'])] 

說明lambda x: (x[0], list(x[1:]))

  1. x[0]將使所述第一元件是 輸出的第一個元素
  2. x[1:]將使第一個除外的所有元素是 在第二元件
  3. list(x[1:])將迫使該要一個列表 ,因爲默認將是一個元組
+0

正是我需要的,謝謝! – nikos