2016-07-31 52 views
2

我有一個CSV文件存儲在路徑/ tmp/home /下的本地窗口HDFS(hdfs:// localhost:54310)中。 我想從HDFS加載這個文件來觸發Dataframe。所以,我想this無法從火狐Dataframe中的HDFS加載文件

val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()

然後

val path = "hdfs://localhost:54310/tmp/home/mycsv.csv" 
import sparkSession.implicits._ 

spark.sqlContext.read 
    .format("com.databricks.spark.csv") 
    .option("header", "true") 
    .option("inferSchema", "true") 
    .load(path) 
    .show() 

但在運行時失敗,下面異常堆棧跟蹤:

Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse 
at org.apache.hadoop.fs.Path.initialize(Path.java:205) 
at org.apache.hadoop.fs.Path.<init>(Path.java:171) 
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114) 
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145) 
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89) 
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95) 
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95) 
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112) 
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112) 
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111) 
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) 
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) 
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382) 
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143) 
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) 

C:/測試/ sampleApp /是所在的路徑我的示例項目在於。但是我已經指定了HDFS路徑。

此外,這工作完全正常與普通的RDD

val path = "hdfs://localhost:54310/tmp/home/mycsv.csv" 
val sc = SparkContext.getOrCreate() 
val rdd = sc.textFile(path) 
println(rdd.first()) //prints first row of CSV file 

我發現並試圖this很好,但沒有運氣:(

我失去了一些東西?爲什麼火花看着我的本地文件系統&不是HDFS?

我使用的火花2.0 Hadoop的HDFS 2.7.2使用Scala 2.11。

編輯:只有一個額外的信息,我試圖降級火花1.6.2。我能夠使它工作。所以我認爲這是一個在火花2.0中的錯誤

+0

你能試着用'/ tmp目錄的/ home/mycsv.csv'? –

+0

@AlbertoBonsanto,這會觸發'org.apache.spark.sql.AnalysisException:路徑不存在:file:/tmp/home/mycsv.csv;'exception – Aiden

+0

'hdfs://tmp/home/mycsv.csv '? –

回答