1
Spark 2.1,ETL將源文件系統中的文件轉換爲parquet,並將小parquets放入folder1中。 Spark1在folder1上流式傳輸工作正常,但是對於HDFS而言,文件夾1中的parquet文件太小。我們必須合併較大的小拼塊文件,但是當我嘗試從文件夾1中刪除文件時,火花流式處理上升異常:是否可以從Spark Streaming文件夾中刪除文件?
17/07/26 17:16:23錯誤StreamExecution:Query [id = f29783ea- bdfb-4b59-a6f6-b77f79509a5a,指定runid = cbcce2b2-7d7b-4e31-A15A-7efed420f974]因錯誤 java.io.FileNotFoundException終止:文件不存在
是否有可能合併火花流文件夾中的文件拼花?
Welcome to
____ __
/__/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.types._
val userSchema = new StructType()
.add("itemId", "string")
.add("tstamp", "integer")
.add("rowtype", "string")
.add("rowordernumber", "integer")
.add("parentrowordernumber", "integer")
.add("fieldname", "string")
.add("valuestr", "string")
val csvDF = spark.readStream.schema(userSchema).parquet("/folder1/folder2")
csvDF.createOrReplaceTempView("tab1")
val aggDF = spark.sql("select distinct count(itemId) as cases_count from tab1")
aggDF
.writeStream
.outputMode("complete")
.format("console")
.start()
aggDF
.writeStream
.queryName("aggregates") // this query name will be the table name
.outputMode("complete")
.format("memory")
.start()
spark.sql("select * from aggregates").show()
// Exiting paste mode, now interpreting.
+-----------+
|cases_count|
+-----------+
+-----------+
import org.apache.spark.sql.types._
userSchema: org.apache.spark.sql.types.StructType = StructType(StructField(itemId,StringType,true), StructField(tstamp,IntegerType,true), StructField(rowtype,StringType,true), StructField(rowordernumber,IntegerType,true), StructField(parentrowordernumber,IntegerType,true), StructField(fieldname,StringType,true), StructField(valuestr,StringType,true))
csvDF: org.apache.spark.sql.DataFrame = [itemId: string, tstamp: int ... 5 more fields]
aggDF: org.apache.spark.sql.DataFrame = [cases_count: bigint]
scala> -------------------------------------------
Batch: 0
-------------------------------------------
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
+-----------+
|cases_count|
+-----------+
| 292086106|
+-----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----------+
|cases_count|
+-----------+
| 292086106|
+-----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----------+
|cases_count|
+-----------+
| 292086106|
+-----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----------+
|cases_count|
+-----------+
| 292086106|
+-----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----------+
|cases_count|
+-----------+
| 324016758|
| 292086106|
+-----------+
-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+
|cases_count|
+-----------+
| 355839229|
| 324016758|
| 292086106|
+-----------+
17/07/26 17:16:23 ERROR StreamExecution: Query [id = f29783ea-bdfb-4b59-a6f6-b77f79509a5a, runId = cbcce2b2-7d7b-4e31-a15a-7efed420f974] terminated with error
java.io.FileNotFoundException: File does not exist: /folder1/folder2/P-FMVDBAF-4021-20161107152556-1_006.gz.parquet
這是Spark Streaming還是Structured Streaming?謹慎分享一些代碼?看起來像結構化流媒體。你還可以包含整個堆棧跟蹤嗎? –
我已經更新了與示例代碼的主要帖子,是的,它是結構化流媒體,我使用spark-shell來執行代碼。 – Triffids