2017-04-19 38 views
1

當爲CCO調用Apache Mahout的SimilarityAnalysis時,我得到一個有關NegativeArraySizeException的致命異常。Apache Mahout針對CCO拋出的SimilarityAnalysis NegativeArraySizeException

的代碼我跑看起來像這樣:

val result = SimilarityAnalysis.cooccurrencesIDSs(myIndexedDataSet:Array[IndexedDataset], 
     randomSeed = 1234, 
     maxInterestingItemsPerThing = 3, 
     maxNumInteractions = 4) 

我看到下面的錯誤和相應的堆棧跟蹤:

17/04/19 20:49:09 ERROR Executor: Exception in task 0.0 in stage 11.0 (TID 20) 
java.lang.NegativeArraySizeException 
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/04/19 20:49:09 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 21) 
java.lang.NegativeArraySizeException 
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/04/19 20:49:09 WARN TaskSetManager: Lost task 0.0 in stage 11.0 (TID 20, localhost): java.lang.NegativeArraySizeException 
    at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:57) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:73) 
    at org.apache.mahout.sparkbindings.SparkEngine$$anonfun$5.apply(SparkEngine.scala:72) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) 
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

我使用Apache Mahout的版本0.13.0

回答

1

這始終表示其中一個輸入矩陣爲空。陣列中有多少個矩陣?每個行和列的數量是多少? IndexedDatasetSpark有一個伴侶對象,它提供了一個構造函數,在Scala中被稱爲apply,它需要一個RDD[String, String],所以如果你可以將你的數據放入RDD中,只需構建IndexedDatasetSpark。這裏的字符串對是user-id,item-id,用於購買某些事件。

見同伴對象的位置:https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L75

一點點搜索會發現代碼把一個CSV到RDD [字符串,字符串]用代碼或使一行。它會是這個樣子:

val rawPurchaseInteractions = sc.textFile("/path/in/hdfs").map { line => 
    (line.split("\,")(0), (line.split("\,")(1)) 
} 

雖然這兩次分裂,它需要一個逗號分隔與user-id,item-id一個文本文件,對於一些喜歡「購買」類型的交互行列表。如果文件中有其他字段,只需分割以獲取用戶標識和項目標識。 map函數中的行返回一對字符串,因此生成的RDD將是正確的類型,即RDD[String, String]。將此傳遞給IndexedDatasetSpark:

val purchasesRdd = IndexedDatasetSpark(rawPurchaseInteractions)(sc) 

其中sc是您的Spark上下文。這應該會給您一個非空的IndexedDatasetSpark,您可以通過查看包裝的BiDictionary的大小或通過調用封裝的Mahout DRM上的方法來檢查。

順便說一句,這假設沒有標題的csv。這是文本分隔的不完整規格csv。在Spark中使用其他方法可以閱讀真正的CSV,但可能沒有必要。

+0

感謝@pferrel的回覆,我找出了與Mahout無關的問題(見下文)。 – ldeluca

0

這個問題實際上已經無關亨利馬烏但早期行:

inputRDD.filter(_ (1) == primaryFilter).map(o => (o(0), o(2))) 

被關閉,我有1〜3個,而不是0到2我想肯定的地方它是由範圍是在Mahout內給出的錯誤,但事實證明這是真正的問題。