與RDD

2015-05-21 37 views
0

工作,我有一個RDD與RDD

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0', 
u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1', 
u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1', 
u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0'] 

有了這個代碼

rdd = rdd.groupBy(lambda x: x.split(",")[9]) 
new_rdds = [sc.parallelize(x[1]) for x in rdd.collect()] 

for x in new_rdds: 
    print x.collect() 

[u'1,0,0,0,0,0,0,0,1,2013,52,0,4,1,0'], 
[u'1,0,0,0,1,1,0,1,1,2012,49,1,1,0,1', 
    u'1,0,0,0,1,1,0,0,1,2012,49,1,1,0,1'] 
[ u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0'] 

有沒有辦法讓只有特定的RDD,例如在x [9] = 2014

所以我可以得到

[u'0,1,0,0,0,0,1,1,1,2014,45,0,0,1,0']

回答

1

您可以過濾輸入rdd,例如rdd.filter(lambda x: x.split(",")[9] == 2014)

1

您可以使用filter()來選擇特定的行。

隨着你的出發RDD:

rdd = rdd.filter(lambda line: line.split(",")[9] == 2014)