0
我想要計算兩個功能之間的關聯,即從兩個單獨的文本文件中讀取,如下所示。重載的方法值corr與替代
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.stat.Statistics
import scala.io.Source
object Corr {
def main() {
val sparkSession = SparkSession.builder
.master("local")
.appName("Correlation")
.getOrCreate()
val sc = sparkSession.sparkContext
val feature_1 = Source.fromFile("feature_1.txt").getLines.toArray
val feature_2 = Source.fromFile("feature_2.txt").getLines.toArray
val feature_1_dist = sc.parallelize(feature_1)
val feature_2_dist = sc.parallelize(feature_2)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
println(s"Correlation is: $correlation")
}
}
Corr.main()
不過,我得到以下錯誤:
overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double],method: String)scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double],method: String)scala.Double
cannot be applied to (org.apache.spark.rdd.RDD[String], org.apache.spark.rdd.RDD[String], String)
val correlation: Double = Statistics.corr(feature_1_dist, feature_2_dist, "pearson")
我所試圖做的,看起來非常相似的例子here但我不能弄明白。
什麼是'統計'?你可以添加你的進口?上面的發佈代碼沒有'import'語句。 – Paul
@Paul對不起。我剛剛添加了它們。 –
將特徵讀入存儲在主數組中的數組已經是一個可疑的舉動,如果特徵很大,則不應該發生這種情況。 – Paul