如何在Pyspark中定義一個空的數據框並在其中添加相應的數據框？

所以我想從一個目錄讀取csv文件，作爲一個pyspark數據幀，然後將它們追加到單個數據幀中。沒有在pyspark中找到替代方案，這是我們在熊貓中的做法。如何在Pyspark中定義一個空的數據框並在其中添加相應的數據框？

例如，在熊貓，我們這樣做：

files=glob.glob(path +'*.csv') 

df=pd.DataFrame() 

for f in files: 
    dff=pd.read_csv(f,delimiter=',') 
    df.append(dff)

在Pyspark我都試過，但沒有成功

schema=StructType([]) 
union_df = sqlContext.createDataFrame(sc.emptyRDD(),schema) 

for f in files: 
    dff = sqlContext.read.load(f,format='com.databricks.spark.csv',header='true',inferSchema='true',delimiter=',') 
    df=df.union_All(dff)

會很感激的任何幫助。

感謝

來源

2017-04-10 Gaurav Chawla

架構應該使用「unionAll」 2個dataframes時相同。因此，空數據框的模式應該按照csv模式。

對於如：

schema = StructType([ 
    StructField("v1", LongType(), True), StructField("v2", StringType(), False), StructField("v3", StringType(), False) 
]) 
df = sqlContext.createDataFrame([],schema)

或者你可以這樣做：

files=glob.glob(path +'*.csv') 

for idx,f in enumerate(files): 
    if idx == 0: 
     df = spark.read.csv(f,header=True,inferSchema=True) 
     dff = df 
    else: 
     df = spark.read.csv(f,header=True,inferSchema=True) 
     dff=dff.unionAll(df)

來源

2017-04-10 08:14:52

的一種方式，然後你可以使用unionAll將新的數據框連接到空的數據框，甚至運行迭代來組合一堆數據幀t ogether。

from pyspark.sql.types import StructType 
from pyspark.sql.types import StructField 
from pyspark.sql.types import StringType 

sc = SparkContext(conf=SparkConf()) 
spark = SparkSession(sc)  # Need to use SparkSession(sc) to createDataFrame 

schema = StructType([ 
    StructField("column1",StringType(),True), 
    StructField("column2",StringType(),True) 
]) 
empty = spark.createDataFrame(sc.emptyRDD(), schema) 

empty = empty.unionAll(addOndata)

來源

2017-04-10 08:53:12

第一定義方案：

獲得此按以下步驟進行火花2.1

來源

2017-10-23 22:18:54

如何在Pyspark中定義一個空的數據框並在其中添加相應的數據框？

回答

相關問題