2017-02-13 25 views
0

我正在使用pandas chunks功能在csv中讀取數據。它的作品,除了我不能保留標題。有沒有辦法做到這一點?這裏是示例代碼:pyspark使用熊貓讀取csv,如何保持標題

import pyspark 
import pandas as pd 
sc = pyspark.SparkContext(appName="myAppName") 
spark_rdd = sc.emptyRDD() 

# filename: csv file 
chunks = pd.read_csv(filename, chunksize=10000) 
for chunk in chunks: 
    spark_rdd += sc.parallelize(chunk.values.tolist()) 

    #print(chunk.head()) 
    #print(spark_rdd.toDF().show()) 
    #break 

spark_df = spark_rdd.toDF() 
spark_df.show() 

回答

1

試試這個:

import pyspark 
import pandas as pd 
sc = pyspark.SparkContext(appName="myAppName") 
spark_rdd = sc.emptyRDD() 

# Read ten rows to get column names 
x = pd.read_csv(filename,nrows=10) 
mycolumns = list(x) 

# filename: csv file 
chunks = pd.read_csv(filename, chunksize=10000) 
for chunk in chunks: 
    spark_rdd += sc.parallelize(chunk.values.tolist()) 

spark_df = spark_rdd.map(lambda x:tuple(x)).toDF(mycolumns) 
spark_df.show() 
+0

讀取頭,''X = pd.read_csv(文件名,nrows = 1)''應該夠了嗎? – muon

+0

我同意它的隨意性,如果你採取至少一行1,5或10行,幾乎無關緊要。 –

0

我結束了使用databricks'火花CSV

sc = pyspark.SparkContext() 
sql = pyspark.SQLContext(sc) 

df = sql.read.load(filename, 
       format='com.databricks.spark.csv', 
       header='true', 
       inferSchema='true')