0
我正在處理spark中的ETL管道,我發現推送發佈時間是帶寬密集型的。我釋放腳本(僞):在火花工人類路徑中自定義JAR的最佳方法
sbt assembly
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar
spark-submit \
--class comapplications.WindowsETLElastic \
--master spark://spark-submit.cloud \
--deploy-mode cluster \
--verbose \
--conf "spark.executor.memory=16g" \
"$JAR_URL"
其工作,但可能需要4分鐘內組裝一分鐘推。我的build.sbt:
name := "secmon_etl"
version := "1.2"
scalaVersion := "2.11.8"
exportJars := true
assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar"
libraryDependencies ++= Seq (
"org.apache.spark" %% "spark-core" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0",
"io.spray" %% "spray-json" % "1.3.3",
// "commons-net" % "commons-net" % "3.5",
// "org.apache.httpcomponents" % "httpclient" % "4.5.2",
"org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1"
)
assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) {
(old) => {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
}
該問題似乎是elasticsearch-spark-20_2.11的龐大規模。它爲我的uberjar增加了大約90MB。我很樂意將它轉換爲對火花主機的依賴關係provided
,使其不必進行封裝。問題是,最好的辦法是什麼?我應該手動複製jar嗎?還是有一種指定依賴關係並使用工具解決所有傳遞依賴關係的萬無一失的方法?
我跑到一個錯誤的地方,主無法重新啓動。超級罐不能在主路徑上,否則它會自動運行一些代碼並破壞動物園管理員連接代碼。 – xrl