2017-08-03 175 views
0
from pyspark import SparkContext, SparkConf 
from pyspark.sql import SparkSession 
import gc 
import pandas as pd 
import datetime 
import numpy as np 
import sys 



APP_NAME = "DataFrameToCSV" 

spark = SparkSession\ 
    .builder\ 
    .appName(APP_NAME)\ 
    .config("spark.sql.crossJoin.enabled","true")\ 
    .getOrCreate() 

group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2] 

dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"] 

#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0] 
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0] 

dataFrameArr = np.column_stack((group_ids,dates,event)) 

df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"]) 

上述python代碼將在gcloud dataproc上的spark簇上運行。我想將大熊貓數據框保存爲gcloud存儲桶中的csv文件,位於gs:// mybucket/csv_data/將熊貓數據幀保存爲csv到gcloud存儲桶

我該怎麼做?

回答

1

所以,我想出瞭如何做到這一點。從上面的代碼繼續,這裏是解決方案:

sc = SparkContext.getOrCreate() 

from pyspark.sql import SQLContext 
sqlCtx = SQLContext(sc) 
sparkDf = sqlCtx.createDataFrame(df)  
sparkDf.coalesce(1).write.option("header","true").csv('gs://mybucket/csv_data')