有numFeatures之間HashingTF在星火MLlib和術語在文檔(句子)的實際數量的任何關係?Spark MLlib中的HashingTF中的numFeatures和文檔中的實際條目數之間的關係是什麼?
List<Row> data = Arrays.asList(
RowFactory.create(0.0, "Hi I heard about Spark"),
RowFactory.create(0.0, "I wish Java could use case classes"),
RowFactory.create(1.0, "Logistic regression models are neat")
);
StructType schema = new StructType(new StructField[]{
new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> sentenceData = spark.createDataFrame(data, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> wordsData = tokenizer.transform(sentenceData);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
Dataset<Row> featurizedData = hashingTF.transform(wordsData);
正如Spark Mllib文檔中提到的那樣,HashingTF將每個句子轉換爲具有numFeatures長度的特徵向量。 如果這裏的每個文檔,在這種情況下,句子包含數千個術語,會發生什麼? numFeatures的價值應該是什麼?如何計算該值?