2017-08-22 161 views
0

如何更改Pyspark中嵌套列的數據類型?舉例來說,我如何將字符串的值的數據類型更改爲int?Pyspark:更改嵌套列數據類型

參考:how to change a Dataframe column from String type to Double type in pyspark

{ 
    "x": "12", 
    "y": { 
     "p": { 
      "name": "abc", 
      "value": "10" 
     }, 
     "q": { 
      "name": "pqr", 
      "value": "20" 
     } 
    } 
} 
+0

1.請問這種變化需要是持久的,有改動保存至JSON文件?或者你在進行手術時是否需要精確度? – diek

+0

@diek需要白色書寫json文件 –

回答

2

可以使用讀取JSON數據

from pyspark import SQLContext 

sqlContext = SQLContext(sc) 
data_df = sqlContext.read.json("data.json", multiLine = True) 

data_df.printSchema() 

輸出

root 
|-- x: long (nullable = true) 
|-- y: struct (nullable = true) 
| |-- p: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 
| |-- q: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 

現在你可以從y中列存取數據

data_df.select("y.p.name") 
data_df.select("y.p.value") 

輸出

abc, 10 

好了,解決的辦法是用正確的模式與錯誤的架構

from pyspark.sql.functions import * 
from pyspark.sql import Row 

df3 = spark.read.json("data.json", multiLine = True) 

# create correct schema from old 
c = df3.schema['y'].jsonValue() 
c['name'] = 'z' 
c['type']['fields'][0]['type']['fields'][1]['type'] = 'long' 
c['type']['fields'][1]['type']['fields'][1]['type'] = 'long' 

y_schema = StructType.fromJson(c['type']) 

# define a udf to populate the new column. Row are immuatable so you 
# have to build it from start. 

def foo(row): 
    d = Row.asDict(row) 
    y = {} 
    y["p"] = {} 
    y["p"]["name"] = d["p"]["name"] 
    y["p"]["value"] = int(d["p"]["value"]) 
    y["q"] = {} 
    y["q"]["name"] = d["q"]["name"] 
    y["q"]["value"] = int(d["p"]["value"]) 

    return(y) 
map_foo = udf(foo, y_schema) 

# add the column 
df3_new = df3.withColumn("z", map_foo("y")) 

# delete the column 
df4 = df3_new.drop("y") 


df4.printSchema() 

輸出

root 
|-- x: long (nullable = true) 
|-- z: struct (nullable = true) 
| |-- p: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 
| |-- q: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 


df4.show() 

輸出

添加新的嵌套列,除去列
+---+-------------------+ 
| x|     z| 
+---+-------------------+ 
| 12|[[abc,10],[pqr,10]]| 
+---+-------------------+ 
+0

@aswinids我編輯了這個問題。對此有任何想法? –

+0

@aswinids:感謝您的幫助。我們在json模式中有decima/timestamp數據類型嗎? –

+0

@aswinids:如果我將10的值更改爲「10」並使用type:'long',那麼我會得到null –

0

使用任意變量名似乎很簡單,但這是有問題的並且與PEP8相反。而在處理數字時,我建議避免在迭代這些結構時使用的常用名稱,即值。

import json 

with open('random.json') as json_file: 
    data = json.load(json_file) 

for k, v in data.items(): 
    if k == 'y': 
     for key, item in v.items(): 
      item['value'] = float(item['value']) 


print(type(data['y']['p']['value'])) 
print(type(data['y']['q']['value'])) 
# mac → python3 make_float.py 
# <class 'float'> 
# <class 'float'> 
json_data = json.dumps(data, indent=4, sort_keys=True) 
with open('random.json', 'w') as json_file: 
    json_file.write(json_data) 

out json file

+0

這個問題的關鍵部分是我們每天都會產生大約60GB的數據,我們需要確保可擴展性,這就是爲什麼Spark是出路的原因 –

+0

當然,這無法處理如此龐大的數據量。 爲什麼你引用的問題不起作用? 從他們給出的文檔處理這個例子: https://ghostbin.com/paste/wt5y6 – diek