爲每個數據點創建單獨的地圖並使用地圖轉換爲雙倍值。
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}
是的,我肯定會考慮扔掉那個類別,因爲你注意到的第三個數字的問題與我所做的是一樣的。我只是想知道是否有辦法將它或其他類別轉化爲有意義但不具有過度影響力的價值。我曾考慮過縮小尺寸,但我可能不得不試驗一下,看哪些是有意義的,哪些不是。 – Baldier
對此的+1:「直接使用k-NN不會工作,因爲多個不相似的維度之間的」距離「的想法沒有明確定義」:) – bendaizer