2017-10-10 98 views
0

關於如何在pyspark 1.6.1中將rdd轉換爲數據幀並將數據幀轉換回rdd的任何示例? toDF()不能在1.6.1中使用?如何在pyspark 1.6.1中將rdd轉換爲數據框?

例如,我有一個這樣的RDD:

data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \ 
         ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]) 

回答

0

如果由於某種原因,你不能使用.toDF()方法不能,我提出的解決方案是這樣的:

data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \ 
        ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])) 

這將創建名稱爲「_n」的DF,其中n是列的編號。如果你想重新命名列,我建議你看看這個帖子:How to change dataframe column names in pyspark?。但是,所有你需要做的是:

data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five") 

現在,讓我們看到了DF:

data_named.show() 

,這將輸出:

+---+---+-----+----+----+ 
|One|Two|Three|Four|Five| 
+---+---+-----+----+----+ 
| a| b| c| 1| 4| 
| o| u| w| 9| 3| 
| s| q| a| 8| 6| 
| l| g| z| 8| 3| 
| a| b| c| 9| 8| 
| s| q| a| 10| 10| 
| l| g| z| 20| 20| 
| o| u| w| 77| 77| 
+---+---+-----+----+----+ 

編輯:再試一次,因爲你應該能夠在spark 1.6.1中使用.toDF()

0

我看不到rdd.toDF無法在pyspark中使用的原因f或者火花1.6.1。請檢查火花例如1.6.1 python文檔上toDF()https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext

根據您的要求,

rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]) 

#rdd to dataframe 
df = rdd.toDF() 
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4') 

#dataframe to rdd 
rdd2 = df.rdd 
相關問題