0
我是一個新手,需要一些幫助來調試非常慢的火花性能。 我正在做轉換,並且已經運行了2個多小時。火花性能很慢
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = [email protected]
scala> val t1_df = hiveContext.sql("select * from T1")
scala> t1_df.registerTempTable("T1")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> t1_df.count
17/06/07 07:26:51 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res3: Long = 1732831
scala> val t1_df1 = t1_df.dropDuplicates(Array("c1","c2","c3", "c4"))
scala> df1.registerTempTable("ABC")
warning: there was one deprecation warning; re-run with -deprecation for details
scala> hiveContext.sql("select * from T1 where c1 not in (select c1 from ABC)").count
[Stage 4:====================================================> (89 + 8)/97]
我使用spark2.1.0並與250GB RAM中的每個節點7和64個虛擬核的亞馬遜虛擬機集羣上讀取來自hive.2.1.1數據。有了這個龐大的資源,我期待這個簡單的查詢在1.7密爾recs飛行,但它的痛苦緩慢。 任何指針都會有很大的幫助。
更新: 添加解釋計劃:
scala> hiveContext.sql("select * from T1 where c1 not in (select c1 from ABC)").explain
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, LeftAnti, (isnull((c1#26 = c1#26#1398)) || (c1#26 = c1#26#1398))
:- FileScan parquet default.t1_pq[cols
more fields] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<hdr_msg_src:string,hdr_recv_tsmp:timestamp,hdr_desk_id:string,execprc:string,dreg:string,c...
+- BroadcastExchange IdentityBroadcastMode
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- Exchange hashpartitioning(c1#26, c2#59, c3#60L, c4#82, 200)
+- *HashAggregate(keys=[c1#26, c2#59, c3#60L, c4#82], functions=[])
+- *FileScan parquet default.atn_load_pq[c1#26,c2#59,c3#60L,c4#82] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://<hostname>/user/hive/warehouse/atn_load_pq], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:string,c2:string,c3:bigint,c4:string>
多少資源,你有分配('spark.executor.instances'和'spark.executor.cores'),確保你已經把一些合理的數字這些設置。此外,請在SparkUI中查看作業/階段是否具有足夠的並行性。 –
你能爲我們做一件事嗎?添加命令以顯示物理計劃:'hiveContext.sql(「select * from T1 where c1 not in(select c1 from ABC)」).explain()' –
Thiago,添加了解釋計劃。 – birjoossh