2016-03-07 62 views
1

星火1.6星火1.6 Pearson相關

如果我有一個數據集,我想用我應該使用何種工具Pearson相關係數來identifiy與最大功率預測數據集中的特點?

簡易方法我用...是:

val columns = x.columns.toList.filterNot(List("id","maxcykle","rul") contains) 
val corrVithRul = columns.map(c => (c,x.stat.corr("rul", c, "pearson"))) 

Output: 

    columns: List[String] = List(cykle, setting1, setting2, setting3, s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11, s12, s13, s14, s15, s16, s17, s18, s19, s20, s21, label1, label2, a1, sd1, a2, sd2, a3, sd3, a4, sd4, a5, sd5, a6, sd6, a7, sd7, a8, sd8, a9, sd9, a10, sd10, a11, sd11, a12, sd12, a13, sd13, a14, sd14, a15, sd15, a16, sd16, a17, sd17, a18, sd18, a19, sd19, a20, sd20, a21, sd21) 
    corrVithRul: List[(String, Double)] = List((cykle,-0.7362405993070199), (setting1,-0.0031984575547410617), (setting2,-0.001947628351500473), (setting3,NaN), (s1,-0.011460304217886725), (s2,-0.6064839743782909), (s3,-0.5845203909175897), (s4,-0.6789482333860454), (s5,-0.011121400898477964), (s6,-0.1283484484732187), (s7,0.6572226620548292), (s8,-0.5639684065744165), (s9,-0.3901015749180319), (s10,-0.04924720421765515), (s11,-0.6962281014554186), (s12,0.6719831036132922), (s13,-0.5625688251505582), (s14,-0.30676887025759053), (s15,-0.6426670441973734), (s16,-0.09716223410021836), (s17,-0.6061535537829589), (s18,NaN), (s19,NaN), (s20,0.6294284994377392), (s21,0.6356620421802835), (label1,-0.5665958821050425), (label2,-0.548191636440298), (a1,0.040592887198906136), (sd1,NaN), (a2,-0.7364292... 

這當然提交每個地圖迭代一個作業,Statistics.corr可能是我所期待的?

回答

4

Statistics.corr在這裏看起來像是正確的選擇。您可能考慮的另一個選項是RowMatrix.columnSimilarities(列之間的餘弦相似度,可選帶有采用閾值採樣的優化版本)和RowMatrix.computeCovariance。無論如何,您必須首先將您的數據彙編到Vectors。假設列已經是DoubleType您可以使用VectorAssembler

import org.apache.spark.ml.feature.VectorAssembler 
import org.apache.spark.mllib.linalg.Vector 

val df: DataFrame = ??? 

val assembler = new VectorAssembler() 
    .setInputCols(df.columns.diff(Seq("id","maxcykle","rul"))) 
    .setOutputCol("features") 

val rows = assembler.transform(df) 
    .select($"features") 
    .rdd 
    .map(_.getAs[Vector]("features")) 

接下來,您可以使用Statistics.corr

import org.apache.spark.mllib.stat.Statistics 

Statistics.corr(rows) 

或轉換爲RowMatrix

import org.apache.spark.mllib.linalg.distributed.RowMatrix 

val mat = new RowMatrix(rows) 

mat.columnSimilarities(0.75) 
+0

我不知道如何從interpit矩陣S.corr,如果列「x」是我的預測器/目標列 – oluies

+0

不幸的是,你將不得不按索引進行搜索,以便將依賴項nt首先變量。否則,它只是你所期望的相關矩陣。 – zero323

+0

也許應該在相關之前正常化? – oluies