我有一個CSV文件,並且希望對數據執行簡單的LinearRegressionWithSGD。Spark - 從CSV文件創建(標籤,特徵)對的RDD
的採樣數據是如下(在文件中的總行數是99包括標籤),並且目標是預測的Y_3變量:
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0
720.0,35.0,1430.0,1.0,0.0,0.0,5.0
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0
我已經讀取與該數據以下命令:
val data = sc.textFile(datadir + "/data_2.csv");
當我想創建的(標記,特徵)對用下面的命令RDD:
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
所以我不能繼續訓練模型,任何幫助嗎?
P.S.我在Windows 7 x64中使用Scala IDE運行Spark。
您需要過濾出標題。另請參閱:http://stackoverflow.com/questions/24299427/how-do-i-convert-csv-file-to-rdd/24307475#24307475 – maasg
謝謝,我拿出頭,但現在當我使用︰val parsedData = data.map {line => val parts = line.split(',') LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split('').map(_。toDouble ))) } .cache(),出現錯誤「值分割不是Array [String]的成員」。你可以幫我嗎? – Mohammad
你正在做'parts(1).split('').map(_。toDouble)',不知道你爲什麼要用空格分割,因爲輸入沒有它們。此外,你改變了你的代碼,我沒有看到你上面提到的錯誤。 –