1
我有下面的代碼與此數問題:我該如何解釋Apache Spark RDD Lineage Graph?
val input1 = rawinput.map(_.split("\t")).map(x=>(x(6).trim(),x)).sortByKey()
val input2 = input1.map(x=> x._2.mkString("\t"))
val x0 = input2.map(_.split("\t")).map(x => (x(6),x(0))
val x1 = input2.map(_.split("\t")).map(x => (x(6),x(1))
val x2 = input2.map(_.split("\t")).map(x => (x(6),x(2))
val x3 = input2.map(_.split("\t")).map(x => (x(6),x(3))
val x4 = input2.map(_.split("\t")).map(x => (x(6),x(4))
val x5 = input2.map(_.split("\t")).map(x => (x(6),x(5))
val x6 = input2.map(_.split("\t")).map(x => (x(6),x(6))
val x = x0 union x1 union x2 union x3 union x4 union x5 union x6
<pre>
**Lineage Graph:**
(7) UnionRDD[25] at union at rddCustUtil.scala:78 []
| UnionRDD[24] at union at rddCustUtil.scala:78 []
| UnionRDD[23] at union at rddCustUtil.scala:78 []
| UnionRDD[22] at union at rddCustUtil.scala:78 []
| UnionRDD[21] at union at rddCustUtil.scala:78 []
| UnionRDD[20] at union at rddCustUtil.scala:78 []
| MapPartitionsRDD[7] at map at rddCustUtil.scala:43 []
| MapPartitionsRDD[6] at map at rddCustUtil.scala:43 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[9] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[8] at map at rddCustUtil.scala:48 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[11] at map at rddCustUtil.scala:53 []
| MapPartitionsRDD[10] at map at rddCustUtil.scala:53 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[13] at map at rddCustUtil.scala:58 []
| MapPartitionsRDD[12] at map at rddCustUtil.scala:58 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[15] at map at rddCustUtil.scala:63 []
| MapPartitionsRDD[14] at map at rddCustUtil.scala:63 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[17] at map at rddCustUtil.scala:68 []
| MapPartitionsRDD[16] at map at rddCustUtil.scala:68 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
| MapPartitionsRDD[19] at map at rddCustUtil.scala:73 []
| MapPartitionsRDD[18] at map at rddCustUtil.scala:73 []
| MapPartitionsRDD[5] at map at rddCustUtil.scala:40 []
| ShuffledRDD[4] at sortByKey at rddCustUtil.scala:38 []
+-(1) MapPartitionsRDD[3] at map at rddCustUtil.scala:38 []
| MapPartitionsRDD[2] at map at rddCustUtil.scala:38 []
| /Data/ MapPartitionsRDD[1] at textFile at rddCustUtil.scala:35 []
| /Data/ HadoopRDD[0] at textFile at rddCustUtil.scala:35 []
</pre>
- 能否請您解釋一下我有多少洗牌階段將被執行,因爲它正顯示出7 ShuffledRDD [4]?
- 你能否給我詳細的解釋下面的DAG流?
- 此操作是否昂貴?
非常感謝Tzach,但如果我使用緩存或持久方法,那麼有多少shuffle需要對數據進行排序? – Souvik
一,我相信。打印輸出可能不會反映緩存階段將被_skipped_感謝的事實。 –
嗨Tzach,我有一個問題在下面的staement: – Souvik