2
我正在用mongodb連接器運行火星殼。但該計劃非常緩慢,我想我不會有計劃的迴應。Mongodb的火花很慢
我的火花shell命令是:
./spark-shell --master spark://spark_host:7077 \
--conf "spark.mongodb.input.uri=mongodb://mongod_user:[email protected]_host:27017/database.collection?readPreference=primaryPreferred" \
--jars /mongodb/lib/mongo-spark-connector_2.10-2.0.0.jar,/mongodb/lib/bson-3.2.2.jar,/mongodb/lib/mongo-java-driver-3.2.2.jar
而且我的應用程序的代碼是:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.mongodb.spark._
import org.bson.Document
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.rdd.MongoRDD
val sparkSession = SparkSession.builder().getOrCreate()
val df = MongoSpark.load(sparkSession)
val dataset = df.filter("thisRequestTime > 1499250131596")
dataset.first // will wait to long time
我是錯過了什麼事情?請幫助我〜 PS:我的火花是獨立模式。應用程序依賴是:
<properties>
<encoding>UTF-8</encoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<scala.compat.version>2.11</scala.compat.version>
<spark.version>2.1.1</spark.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.mongodb.spark</groupId>
<artifactId>mongo-spark-connector_${scala.compat.version}</artifactId>
<version>2.0.0</version>
</dependency>
</dependencies>
您期望的數據集有多大?查詢在MongoDB上運行多長時間? –
謝謝你的回覆@Rick Moritz。總文件數是194920414,在MongoDB中。滿足過濾條件的doc數量爲749216.火花應用半小時後我得到響應。但是在mongodb shell中,我以毫秒爲單位獲得了相同條件的響應。 – Milk
PS:我有一個在mongodb文檔的條件字段索引 – Milk