2015-09-21 74 views
2

新星火,和所有的例子我已閱讀處理小數據集,如:PySpark:許多功能,標記點RDD

RDD = sc.parallelize([ 
LabeledPoint(1, [1.0, 2.0, 3.0]), 
LabeledPoint(2, [3.0, 4.0, 5.0]), 

但是,我有50多個功能的大型數據集。一排

u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5' 

例子我想快速創建一個Labeledpoint RDD在PySpark。我嘗試將最後一個位置作爲Labeledpoint RDD中的第一個數據點索引,然後將第一個n-1位置索引爲稠密向量。但是,我收到以下錯誤。任何指導表示讚賞!注意:如果在創建標記點時將[]更改爲(),則會出現「無效語法」錯誤。

df = myDataRDD.map(lambda line: line.split(',')) 
data = [ 
    LabeledPoint(df[54], df[0:53]) 
] 
TypeError: 'PipelinedRDD' object does not support indexing 
--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-67-fa1b56e8441e> in <module>() 
     2 df = myDataRDD.map(lambda line: line.split(',')) 
     3 data = [ 
----> 4  LabeledPoint(df[54], df[0:53]) 
     5 ] 

TypeError: 'PipelinedRDD' object does not support indexing 
+0

爲了便於說明,在提及最後一個位置作爲第一個數據點時,是否將此作爲標籤和其餘元素作爲LabaledPoint類的特徵來提及? –

回答

4

當你的錯誤狀態,你無法通過索引訪問RDD。 你需要一個第二map語句將序列轉化爲LabeledPoint小號

rows = [u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5', u'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5'] 

rows_rdd = sc.parallelize(rows) # create RDD with given rows 

labeled_points_rdd = rows_rdd\ 
        .map(lambda row: row.split(','))\     # split rows into sequences 
        .map(lambda seq: LabeledPoint(seq[-1],seq[:-2])) # create Labeled Points from these sequences with last Item as label 

print labeled_points_rdd.take(2) 
# prints [LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,...]), 
#   LabeledPoint(5.0,[2596.0,51.0,3.0,258.0,0.0,510.0,221.0,...]) 

注意的是Python負指數讓你訪問序列倒退。

隨着.take(n)你然後從你的RDD獲得第一個n元素。

希望這會有所幫助。

2

您不能使用索引,而必須使用Spark API中可用的方法。所以:

data = [ LabeledPoint(myDataRDD.take(RDD.count()), #Last element 
         myDataRDD.top(RDD.count()-1)) #All but last ] 

(未經測試,不過,這是一般的想法)

+0

感謝您的幫助,我相信這隻適用於行方向正確?我將如何做這個專欄? – adlopez15