2017-05-01 59 views
0

我正在處理spark中的ETL管道,我發現推送發佈時間是帶寬密集型的。我釋放腳本(僞):在火花工人類路徑中自定義JAR的最佳方法

sbt assembly 
openstack object create spark target/scala-2.11/etl-$VERSION-super.jar 
spark-submit \ 
    --class comapplications.WindowsETLElastic \ 
    --master spark://spark-submit.cloud \ 
    --deploy-mode cluster \ 
    --verbose \ 
    --conf "spark.executor.memory=16g" \ 
    "$JAR_URL" 

其工作,但可能需要4分鐘內組裝一分鐘推。我的build.sbt:

name := "secmon_etl" 

version := "1.2" 

scalaVersion := "2.11.8" 

exportJars := true 

assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar" 

libraryDependencies ++= Seq (
    "org.apache.spark" %% "spark-core" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0", 
    "io.spray" %% "spray-json" % "1.3.3", 
// "commons-net" % "commons-net" % "3.5", 
// "org.apache.httpcomponents" % "httpclient" % "4.5.2", 
    "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" 
) 

assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { 
    (old) => { 
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard 
    case x => MergeStrategy.first 
    } 
} 

該問題似乎是elasticsearch-spark-20_2.11的龐大規模。它爲我的uberjar增加了大約90MB。我很樂意將它轉換爲對火花主機的依賴關係provided,使其不必進行封裝。問題是,最好的辦法是什麼?我應該手動複製jar嗎?還是有一種指定依賴關係並使用工具解決所有傳遞依賴關係的萬無一失的方法?

回答

0

我有我的spark工作正在運行,現在要快得多。我跑

sbt assemblyPackageDependency 

這產生了巨大的罐子(110MB!),輕鬆放入火花工作目錄「罐子」的文件夾,所以現在我Dockerfile了火花簇看起來是這樣的:

FROM openjdk:8-jre 

ENV SPARK_VERSION 2.1.0 
ENV HADOOP_VERSION hadoop2.7 
ENV SPARK_MASTER_OPTS="-Djava.net.preferIPv4Stack=true" 

RUN apt-get update && apt-get install -y python 

RUN curl -sSLO http://mirrors.ocf.berkeley.edu/apache/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz && tar xzfC /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz /usr/share && rm /spark-$SPARK_VERSION-bin-$HADOOP_VERSION.tgz 

# master or worker's webui port, 
EXPOSE 8080 
# master's rest api port 
EXPOSE 7077 

ADD deps.jar /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION/jars/ 

WORKDIR /usr/share/spark-$SPARK_VERSION-bin-$HADOOP_VERSION 

部署的配置後,我改變了我的build.sbt所以kafka-streaming/elasticsearch-spark罐子和依賴標記爲provided

name := "secmon_etl" 

version := "1.2" 

scalaVersion := "2.11.8" 

exportJars := true 

assemblyJarName in assembly := s"${name.value}-${version.value}-super.jar" 

libraryDependencies ++= Seq (
    "org.apache.spark" %% "spark-core" % "2.1.0" % "provided", 
    "org.apache.spark" %% "spark-streaming" % "2.1.0" % "provided", 

    "org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.1.0" % "provided", 
    "io.spray" %% "spray-json" % "1.3.3" % "provided", 
    "org.elasticsearch" % "elasticsearch-spark-20_2.11" % "5.3.1" % "provided" 
) 

assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { 
    (old) => { 
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard 
    case x => MergeStrategy.first 
    } 
} 

現在我展開時在20秒內完成!

+0

我跑到一個錯誤的地方,主無法重新啓動。超級罐不能在主路徑上,否則它會自動運行一些代碼並破壞動物園管理員連接代碼。 – xrl

相關問題