2015-10-23 38 views
2

我試圖用Java API實現使用apache spark的LDA算法。方法LDA()。run()接受參數JavaPairRDD文件。 我使用斯卡拉創建RDD [(龍,矢量)如下:從DataFrame轉換爲JavaPairRDD <Long,Vector>

val countVectors = cvModel.transform(filteredTokens) 
    .select("docId", "features") 
    .map { case Row(docId: Long, countVector: Vector) => (docId, countVector) } 
    .cache() 

然後輸入到LDA:

lda.run(countVectors) 

但在Java API,我有CountVectorizerModel使用如下代碼:

CountVectorizerModel cvModel = new CountVectorizer() 
     .setInputCol("filtered").setOutputCol("features") 
     .setVocabSize(vocabSize).fit(filteredTokens); 

樣子說:

(0,(22,[0,8,9,10,14,16,18], 
[1.0,1.0,1.0,1.0,1.0,1.0,1.0])) 
(1,(22,[0,1,2,3,4,5,6,7,11,12,13,15,17,19,20,21], 
1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])) 

如果我想從cvModel轉換爲JavaPairRDD countVectors,該怎麼辦? 我試試這個:

JavaPairRDD<Long, Vector> countVectors = cvModel.transform(filteredTokens) 
      .select("docId", "features").toJavaRDD() 
      .mapToPair(new PairFunction<Row, Long, Vector>() { 
      public Tuple2<Long, Vector> call(Row row) throws Exception { 
       return new Tuple2<Long, Vector>(Long.parseLong(row.getString(0)), Vectors.dense(row.getDouble(1))); 
      } 
      }).cache(); 

但它不起作用。我有例外的時候嘗試:

Vectors.dense(row.getDouble(1)) 

所以,如果您有任何適合於在數據幀cvModel轉換成JavaPairRDD請告訴我。

我使用Spark和MLlib 1.5.1,並Java8

任何幫助,高度讚賞。由於 這裏是異常日誌文件,當我嘗試從數據幀轉換成JavaPairRDD

15/10/25 10:03:07 ERROR Executor: Exception in task 0.0 in stage 7.0  (TID 6) 
java.lang.ClassCastException: java.lang.Long cannot be cast to  java.lang.String 
at org.apache.spark.sql.Row$class.getString(Row.scala:249) 
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191) 
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:88) 
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:1) 
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030) 
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) 
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) 
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
15/10/25 10:03:07 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 6, localhost): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String 
at org.apache.spark.sql.Row$class.getString(Row.scala:249) 
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191) 
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:88) 
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:1) 
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030) 
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278) 
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) 
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262) 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
at org.apache.spark.scheduler.Task.run(Task.scala:88) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
+2

Protip:_it不工作_不是很好的描述問題:) – zero323

+0

謝謝,我將介紹有關異常:) –

+1

好的,所以我們要去某個地方,但我們還沒有到那裏。例外情況通常會提供相當多的信息。什麼是異常的類型,哪一部分完全是由它引起的等等。所有這些都是爲了指導不填寫屏幕。此外,您在編輯時缺少右括號。 – zero323

回答

2

現在,我們有錯誤堆棧,這裏是錯誤:

你正試圖從該行得到一個字符串而你的領域是一個長,所以你需要替換row.getString(0)row.getLong(0)爲初學者。

一旦你解決這個問題,你會碰到來自同一類型,但在不同的層次,其他錯誤,我可以指出,隨着信息給出,但你就可以解決這些問題,如果你申請的下列內容:

行獲得者是特定於每個字段類型的,您需要使用正確的get方法。

要知道你需要使用的方法,如果你不確定,你可以在你的DataFrame上使用printSchema方法來檢查每個字段的類型,然後你可以在官方文檔here中描述的所有類型轉換。