爲什麼我在使用spark + cassandra時出現錯誤：「Size exceeded Integer.MAX_VALUE」？

我有7個cassandra節點（5 nodes with 32 cores and 32G memory, and 4 nodes with 4 cores and 64G memory），並在這個集羣上部署了火花工作者，火花的主人在8th node。我爲他們使用了spark-cassandra-connector。現在我的卡桑德拉有近十億記錄有30場，我寫的斯卡拉包括下面的代碼片段：爲什麼我在使用spark + cassandra時出現錯誤：「Size exceeded Integer.MAX_VALUE」？

def startOneCache(): DataFrame = { 
val conf = new SparkConf(true) 
    .set("spark.cassandra.connection.host", "192.168.0.184") 
    .set("spark.cassandra.auth.username", "username") 
    .set("spark.cassandra.auth.password", "password") 
    .set("spark.driver.maxResultSize", "4G") 
    .set("spark.executor.memory", "12G") 
    .set("spark.cassandra.input.split.size_in_mb","64") 

val sc = new SparkContext("spark://192.168.0.131:7077", "statistics", conf) 
val cc = new CassandraSQLContext(sc) 
val rdd: DataFrame = cc.sql("select user_id,col1,col2,col3,col4,col5,col6 
,col7,col8 from user_center.users").limit(100000192) 
val rdd_cache: DataFrame = rdd.cache() 

rdd_cache.count() 
return rdd_cache 
}

在火花的主我用運行上面的代碼，在執行語句時：rdd_cache.count()，我在一個工人節點的ERROR：192.168.0.185：

16/03/08 15:38:57 INFO ShuffleBlockFetcherIterator: Started 4 remote fetches in 221 ms 
16/03/08 15:43:49 WARN MemoryStore: Not enough space to cache rdd_6_0 in memory! (computed 4.6 GB so far) 
16/03/08 15:43:49 INFO MemoryStore: Memory use = 61.9 KB (blocks) + 4.6 GB (scratch space shared across 1 tasks(s)) = 4.6 GB. Storage limit = 6.2 GB. 
16/03/08 15:43:49 WARN CacheManager: Persisting partition rdd_6_0 to disk instead. 
16/03/08 16:13:11 ERROR Executor: Managed memory leak detected; size = 4194304 bytes, TID = 24002 
16/03/08 16:13:11 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 24002) 
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

我只是想到，最後的錯誤Size exceeds Integer.MAX_VALUE被警告引起：16/03/08 15:43:49 WARN MemoryStore: Not enough space to cache rdd_6_0 in memory! (computed 4.6 GB so far)之前，但我不知道爲什麼，還是我應該設定一個大於.set("spark.executor.memory", "12G")，應該怎麼做我爲了糾正這個嗎？

來源

2016-03-08 abelard2008

No Spark shuffle block can be greater than 2 GB.

Spark uses ByteBuffer as abstraction for storing blocks and its size is limited by Integer.MAX_VALUE (2 billions).

分區的低數目可導致高混洗塊大小。要解決此問題，請嘗試使用rdd.repartition()或rdd.coalesce()或更多來增加分區數。

如果這不起作用，這意味着至少有一個分區仍然太大，您可能需要使用一些更復雜的方法使其更小 - 例如，使用隨機性來均衡RDD數據的分佈個人分區。

來源

2016-03-08 10:53:54

儘管這是一個正確的答案，但一些解釋是有用的。 – zero323

'拉多Buransky'，謝謝！我應該怎麼做才能得到當前rdd中有多少個分區？在我的Spark UI中，總任務是'23660'，這是當前的分區數量，如果是的話，我應該設置多少個分區來解決這個錯誤？ – abelard2008

@ abelard2008試試這個：https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html –

爲什麼我在使用spark + cassandra時出現錯誤：「Size exceeded Integer.MAX_VALUE」？

回答

相關問題