0
我正在使用MLlib和Spark 1.5.1。輸入類型必須是ArrayType(StringType),但是需要StringType。我的代碼有什麼問題?如何使用StopWordsRemover轉換json對象的數據框?
StopWordsRemover remover = new StopWordsRemover()
.setInputCol("text")
.setOutputCol("filtered");
DataFrame df = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json");
System.out.println("***DATAFRAME SCHEMA: " + df.schema());
DataFrame filteredTokens = remover.transform(df);
filteredTokens.show();
OUTPUT:
***DATAFRAME SCHEMA: StructType(StructField(doc_id,LongType,true), StructField(image,StringType,true), StructField(link_title,StringType,true), StructField(sentiment_polarity,DoubleType,true), StructField(sentiment_subjectivity,DoubleType,true), StructField(text,StringType,true), StructField(url,StringType,true))
錯誤:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Input type must be ArrayType(StringType) but got StringType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.feature.StopWordsRemover.transformSchema(StopWordsRemover.scala:149)
at org.apache.spark.ml.feature.StopWordsRemover.transform(StopWordsRemover.scala:129)
at com.bah.ossem.spark.topic.LDACountVectorizer.main(LDACountVectorizer.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
article.json(第1行)
{"doc_id": 11, "sentiment_polarity": 0.223, "link_title": "Donald Trump will live-tweet 's Democratic Debate - Politics.com", "sentiment_subjectivity": 0.594, "url": "https://www.cnn.com/...", "text": "Watch the first Democratic presidential debate Tuesday...", "image": "http://i2.cdn.turner.com..."}
編輯:在Java中實現zero323的Scala代碼和它的工作原理大。謝謝zero323!
Tokenizer tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words");
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame jsondf = sqlContext.read().json("file:///home/ec2-user/spark_apps/article.json");
DataFrame wordsDataFrame = tokenizer.transform(jsondf);
DataFrame filteredTokens = remover.transform(wordsDataFrame);
filteredTokens.show();
CountVectorizerModel cvModel = new CountVectorizer()
.setInputCol("filtered").setOutputCol("features")
.setVocabSize(10).fit(filteredTokens);
cvModel.transform(filteredTokens).show();
是的,謝謝。我認爲輸入將是一串令牌。杜,我現在明白了。在Java中看到這樣會很高興,因爲我圍繞標記化列的方式進行了一些嘗試,同時仍然在數據框中包含了不需要進行標記的其他數據。 –
沒關係。我實現了Java版本,就像上面的scala代碼一樣,它的工作原理非常完美,數據框仍然保留了一切。我認爲在標記化數據框中唯一的列將是我指定的inputcol。非常感謝。 –
我看到你在這個問題上的幫助:http://stackoverflow.com/questions/33308586/convert-from-dataframe-to-javapairrddlong-vector這是正確的方式來創建LDA的語料庫? –