我在Spark中使用dataframe以tablular格式拆分和存儲數據。我在文件中的數據看起來如下 -使用spark python分割dataFrame
{"click_id": 123, "created_at": "2016-10-03T10:50:33", "product_id": 98373, "product_price": 220.50, "user_id": 1, "ip": "10.10.10.10"}
{"click_id": 124, "created_at": "2017-02-03T10:51:33", "product_id": 97373, "product_price": 320.50, "user_id": 1, "ip": "10.13.10.10"}
{"click_id": 125, "created_at": "2017-10-03T10:52:33", "product_id": 96373, "product_price": 20.50, "user_id": 1, "ip": "192.168.2.1"}
,我已經寫了這個代碼,以分割數據 -
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as psf
spark = SparkSession \
.builder \
.appName("Hello") \
.config("World") \
.getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
ratings = spark.createDataFrame(
sc.textFile("transactions.json").map(lambda l: l.split(',')),
["Col1","Col2","Col3","Col4","Col5","Col6"]
)
ratings.registerTempTable("ratings")
final_df = sqlContext.sql("select * from ratings");
final_df.show(20,False)
上面的代碼工作正常,並給出了以下的輸出:
正如您從輸出中看到"click_id and number"
正在顯示,同樣顯示created_at and timestamp
正在顯示。
我想實際上只有表中的值 - click_id,created_at,product_id等。
如何僅將這些值存入我的表格中?
你的意思是,刪除鍵('click_id,created_at'等),並只保留所有6列的值? – desertnaut
@desertnaut是 – Firstname