0
我使用spark 2.0.1和python-2.7來修改和展開一些嵌套的JSON數據。pyspark中的數據修改
{
"created" : '28-12-2001T12:02:01.143',
"class" : 'Class_A',
"sub_class": "SubClass_B",
"properties": {
meta : 'some-info',
...,
interests : {"key1": "value1", "key2":"value2, ..., "keyN":"valueN"}
}
}
使用withColumn
和udf
功能我能夠拉平raw_data到數據幀,看起來像如下
---------------------------------------------------------------------
| created | class | sub_class | meta | interests |
---------------------------------------------------------------------
|28-12-2001T12:02:01.143 | Class_A | SubClass_B |'some-info' | "{key1: 'value1', 'key2':'value2', ..., 'keyN':'valueN'}" |
---------------------------------------------------------------------
現在我想轉換
原始數據(JSON格式) /根據興趣列將這一行分成多行。我怎樣才能做到這一點?
所需的輸出
---------------------------------------------------------------------
| created | class | sub_class | meta | key | value |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key1 | value1 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | key2 | value2 |
---------------------------------------------------------------------
| 28-12-2001T12:02:01.143 | Class_A | SubClass_B | 'some-info' | keyN | valueN |
---------------------------------------------------------------------
謝謝