2015-11-07 40 views
0

我正在使用MLlib和Spark 1.5.1。輸入類型必須是ArrayType(StringType),但是需要StringType。我的代碼有什麼問題?如何使用StopWordsRemover轉換json對象的數據框?

StopWordsRemover remover = new StopWordsRemover() 
         .setInputCol("text") 
         .setOutputCol("filtered"); 

DataFrame df = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json"); 

System.out.println("***DATAFRAME SCHEMA: " + df.schema()); 

DataFrame filteredTokens = remover.transform(df); 
filteredTokens.show(); 

OUTPUT:

***DATAFRAME SCHEMA: StructType(StructField(doc_id,LongType,true), StructField(image,StringType,true), StructField(link_title,StringType,true), StructField(sentiment_polarity,DoubleType,true), StructField(sentiment_subjectivity,DoubleType,true), StructField(text,StringType,true), StructField(url,StringType,true)) 

錯誤:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Input type must be ArrayType(StringType) but got StringType. 
    at scala.Predef$.require(Predef.scala:233) 
    at org.apache.spark.ml.feature.StopWordsRemover.transformSchema(StopWordsRemover.scala:149) 
    at org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:129) 
    at com.bah.ossem.spark.topic.LDACountVectorizer.main(LDACountVectorizer.java:50) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:497) 
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) 
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) 
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) 
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) 
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

article.json(第1行)

{"doc_id": 11, "sentiment_polarity": 0.223, "link_title": "Donald Trump will live-tweet 's Democratic Debate - Politics.com", "sentiment_subjectivity": 0.594, "url": "https://www.cnn.com/...", "text": "Watch the first Democratic presidential debate Tuesday...", "image": "http://i2.cdn.turner.com..."} 

編輯:在Java中實現zero323的Scala代碼和它的工作原理大。謝謝zero323!

Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words"); 

StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered"); 

DataFrame jsondf = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json"); 

DataFrame wordsDataFrame = tokenizer.transform(jsondf); 

DataFrame filteredTokens = remover.transform(wordsDataFrame); 
filteredTokens.show(); 

CountVectorizerModel cvModel = new CountVectorizer() 
     .setInputCol("filtered").setOutputCol("features") 
     .setVocabSize(10).fit(filteredTokens); 
cvModel.transform(filteredTokens).show(); 

回答

4

那麼,錯誤信息是不言自明的。 StopWordsRemover要求StringArray作爲輸入,而不是String。這意味着你必須首先標記你的數據。使用斯卡拉API:

import org.apache.spark.ml.feature.Tokenizer 
import org.apache.spark.ml.feature.StopWordsRemover 
import org.apache.spark.sql.DataFrame 

val tokenizer: Tokenizer = new Tokenizer() 
    .setInputCol("text") 
    .setOutputCol("tokens_raw") 

val remover: StopWordsRemover = new StopWordsRemover() 
    .setInputCol("tokens_raw") 
    .setOutputCol("tokens") 

val tokenized: DataFrame = tokenizer.transform(df) 
val filtered: DataFrame = remover.transform(tokenized) 
+0

是的,謝謝。我認爲輸入將是一串令牌。杜,我現在明白了。在Java中看到這樣會很高興,因爲我圍繞標記化列的方式進行了一些嘗試,同時仍然在數據框中包含了不需要進行標記的其他數據。 –

+0

沒關係。我實現了Java版本,就像上面的scala代碼一樣,它的工作原理非常完美,數據框仍然保留了一切。我認爲在標記化數據框中唯一的列將是我指定的inputcol。非常感謝。 –

+0

我看到你在這個問題上的幫助:http://stackoverflow.com/questions/33308586/convert-from-dataframe-to-javapairrddlong-vector這是正確的方式來創建LDA的語料庫? –