Pyspark：更改嵌套列數據類型

如何更改Pyspark中嵌套列的數據類型？舉例來說，我如何將字符串的值的數據類型更改爲int？Pyspark：更改嵌套列數據類型

參考：how to change a Dataframe column from String type to Double type in pyspark

{ 
    "x": "12", 
    "y": { 
     "p": { 
      "name": "abc", 
      "value": "10" 
     }, 
     "q": { 
      "name": "pqr", 
      "value": "20" 
     } 
    } 
}

來源

2017-08-22 J.D

1.請問這種變化需要是持久的，有改動保存至JSON文件？或者你在進行手術時是否需要精確度？ – diek

@diek需要白色書寫json文件 –

可以使用讀取JSON數據

from pyspark import SQLContext 

sqlContext = SQLContext(sc) 
data_df = sqlContext.read.json("data.json", multiLine = True) 

data_df.printSchema()

輸出

root 
|-- x: long (nullable = true) 
|-- y: struct (nullable = true) 
| |-- p: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 
| |-- q: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true)

現在你可以從y中列存取數據

data_df.select("y.p.name") 
data_df.select("y.p.value")

輸出

abc, 10

好了，解決的辦法是用正確的模式與錯誤的架構

from pyspark.sql.functions import * 
from pyspark.sql import Row 

df3 = spark.read.json("data.json", multiLine = True) 

# create correct schema from old 
c = df3.schema['y'].jsonValue() 
c['name'] = 'z' 
c['type']['fields'][0]['type']['fields'][1]['type'] = 'long' 
c['type']['fields'][1]['type']['fields'][1]['type'] = 'long' 

y_schema = StructType.fromJson(c['type']) 

# define a udf to populate the new column. Row are immuatable so you 
# have to build it from start. 

def foo(row): 
    d = Row.asDict(row) 
    y = {} 
    y["p"] = {} 
    y["p"]["name"] = d["p"]["name"] 
    y["p"]["value"] = int(d["p"]["value"]) 
    y["q"] = {} 
    y["q"]["name"] = d["q"]["name"] 
    y["q"]["value"] = int(d["p"]["value"]) 

    return(y) 
map_foo = udf(foo, y_schema) 

# add the column 
df3_new = df3.withColumn("z", map_foo("y")) 

# delete the column 
df4 = df3_new.drop("y") 


df4.printSchema()

輸出

root 
|-- x: long (nullable = true) 
|-- z: struct (nullable = true) 
| |-- p: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 
| |-- q: struct (nullable = true) 
| | |-- name: string (nullable = true) 
| | |-- value: long (nullable = true) 


df4.show()

輸出

添加新的嵌套列，除去列

+---+-------------------+ 
| x|     z| 
+---+-------------------+ 
| 12|[[abc,10],[pqr,10]]| 
+---+-------------------+

來源

2017-08-23 13:49:12 ashwinids

@aswinids我編輯了這個問題。對此有任何想法？ –

@aswinids：感謝您的幫助。我們在json模式中有decima/timestamp數據類型嗎？ –

@aswinids：如果我將10的值更改爲「10」並使用type：'long'，那麼我會得到null –

使用任意變量名似乎很簡單，但這是有問題的並且與PEP8相反。而在處理數字時，我建議避免在迭代這些結構時使用的常用名稱，即值。

import json 

with open('random.json') as json_file: 
    data = json.load(json_file) 

for k, v in data.items(): 
    if k == 'y': 
     for key, item in v.items(): 
      item['value'] = float(item['value']) 


print(type(data['y']['p']['value'])) 
print(type(data['y']['q']['value'])) 
# mac → python3 make_float.py 
# <class 'float'> 
# <class 'float'> 
json_data = json.dumps(data, indent=4, sort_keys=True) 
with open('random.json', 'w') as json_file: 
    json_file.write(json_data)

來源

2017-08-24 01:55:13 diek

這個問題的關鍵部分是我們每天都會產生大約60GB的數據，我們需要確保可擴展性，這就是爲什麼Spark是出路的原因 –

當然，這無法處理如此龐大的數據量。爲什麼你引用的問題不起作用？從他們給出的文檔處理這個例子： https://ghostbin.com/paste/wt5y6 – diek

Pyspark：更改嵌套列數據類型

回答

相關問題