2017-06-06 73 views
0

經過對原始數據的一些處理後,我得到了我的結果如下,它像一個鍵與多個值和值是字典值 - 我想作爲鍵+每個字典值PySpark - 如何分割多個字典值

[('HOMICIDE', {'2017': 1}), 
('DECEPTIVE PRACTICE', {'2015': 2, '2017': 2, '2016': 8}), 
('ROBBERY', {'2016': 2}), 
('OTHER OFFENSE', {'2016': 3}), 
('MOTOR VEHICLE THEFT', {'2017': 2, '2016': 2})] 

我怎樣才能讓上面

('HOMICIDE', '2017': 1), 
('DECEPTIVE PRACTICE', '2015': 2), 
('DECEPTIVE PRACTICE', '2017': 2,), 
('DECEPTIVE PRACTICE', '2016': 8), 
('ROBBERY', '2016': 2), 
('OTHER OFFENSE', '2016': 3), 
('MOTOR VEHICLE THEFT', '2017': 2) 
('MOTOR VEHICLE THEFT', '2016': 2)] 

我是否需要移動字典列表,並從那裏我需要處理?

回答

1

只是flatMapValues

In [1]: rdd = sc.parallelize([('HOMICIDE', {'2017': 1}), 
    ...: ('DECEPTIVE PRACTICE', {'2015': 2, '2017': 2, '2016': 8}), 
    ...: ('ROBBERY', {'2016': 2}), 
    ...: ('OTHER OFFENSE', {'2016': 3}), 
    ...: ('MOTOR VEHICLE THEFT', {'2017': 2, '2016': 2})]) 

In [4]: rdd.flatMapValues(dict.items) 
Out[4]: PythonRDD[5] at RDD at PythonRDD.scala:48 

In [5]: rdd.flatMapValues(dict.items).collect() 
Out[5]: 
[('HOMICIDE', ('2017', 1)), 
('DECEPTIVE PRACTICE', ('2015', 2)), 
('DECEPTIVE PRACTICE', ('2017', 2)), 
('DECEPTIVE PRACTICE', ('2016', 8)), 
('ROBBERY', ('2016', 2)), 
('OTHER OFFENSE', ('2016', 3)), 
('MOTOR VEHICLE THEFT', ('2017', 2)), 
('MOTOR VEHICLE THEFT', ('2016', 2))] 

或很長的路要走

In [5]: rdd.flatMap(lambda x: [(x[0], k, v) for k, v in x[1].items()]).collect() 
Out[5]:                   
[('HOMICIDE', '2017', 1), 
('DECEPTIVE PRACTICE', '2015', 2), 
('DECEPTIVE PRACTICE', '2017', 2), 
('DECEPTIVE PRACTICE', '2016', 8), 
('ROBBERY', '2016', 2), 
('OTHER OFFENSE', '2016', 3), 
('MOTOR VEHICLE THEFT', '2017', 2), 
('MOTOR VEHICLE THEFT', '2016', 2)] 
+0

我試圖執行 rdd.flatMapValues(dict.items).collect() 但以下錯誤 - 讓我再次檢查,是否還有其他問題 文件「/usr/local/anaconda/lib/python2.7/pickle.py」 ,線306,在保存 RV =減少(self.proto) 類型錯誤:無法鹹菜method_descriptor對象 –

+0

而不是rdd.flatMapValues(dict.items).collect()時,我給它得到了工作 波紋管> > rdd.flatMapValues(lambda data:data.items())。collect() –

0

波紋管得到了工作 - rdd.flatMapValues(拉姆達數據:data.items())。收集()

感謝@StackPointer和@ user8119199