我有這樣的代碼,如何正確使用Java Spark在Apache Spark中製作TF-IDF語句向量?
public class TfIdfExample {
public static void main(String[] args){
JavaSparkContext sc = SparkSingleton.getContext();
SparkSession spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", "spark-warehouse")
.getOrCreate();
JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
Arrays.asList("this is a sentence".split(" ")),
Arrays.asList("this is another sentence".split(" ")),
Arrays.asList("this is still a sentence".split(" "))), 2);
HashingTF hashingTF = new HashingTF();
documents.cache();
JavaRDD<Vector> featurizedData = hashingTF.transform(documents);
// alternatively, CountVectorizer can also be used to get term frequency vectors
IDF idf = new IDF();
IDFModel idfModel = idf.fit(featurizedData);
featurizedData.cache();
JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData);
System.out.println(tfidfs.collect());
KMeansProcessor kMeansProcessor = new KMeansProcessor();
JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs);
result.collect().forEach(System.out::println);
}
}
我需要得到矢量k均值,但我越來越多載體
[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]
後K-均值工作,我得到它
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
但我認爲它工作不正確,因爲tf-idf必須有另一個視圖。 我認爲mllib
已經爲此準備好了方法,但是我測試了文檔示例並且沒有收到我需要的東西。 Spark的自定義解決方案我沒有找到。可能有人與它合作並給我回答我做錯了什麼?可能是我不正確使用mllib功能?
謝謝,但是你的意思_I我假設打印輸出被截斷,我從控制檯複製粘貼全部。因爲我認爲tf-idf沒有給我帶來真正的向量。我做'新的HashingTF(32);'和第一個元組中的ID變小。但我不明白爲什麼在第二個元組中的一些值我得到0.0 –
我跑你的例子,實際上這些值應該等於零。我已經添加了更多的細節/解釋鏈接 - 讓我知道它是否有幫助。 –
在這個向量中有一個問題,'(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])''[0.28768207245178085,0.0,0.0,0.0]'it tf-idf apply IDF after到TF? –