2015-05-18 69 views
2

我有一個CSV文件,並且希望對數據執行簡單的LinearRegressionWithSGD。Spark - 從CSV文件創建(標籤,特徵)對的RDD

的採樣數據是如下(在文件中的總行數是99包括標籤),並且目標是預測的Y_3變量:

y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8 
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0 
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0 
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0 
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0 
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0 
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0 
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0 
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0 
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0 
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0 
720.0,35.0,1430.0,1.0,0.0,0.0,5.0 
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0 
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0 
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0 
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0 
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0 
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0 
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0 
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0 

我已經讀取與該數據以下命令:

val data = sc.textFile(datadir + "/data_2.csv"); 

當我想創建的(標記,特徵)對用下面的命令RDD:

val parsedData = data.map { line => 
    val parts = line.split(',') 
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 
    }.cache() 

indicated error in the output

所以我不能繼續訓練模型,任何幫助嗎?

P.S.我在Windows 7 x64中使用Scala IDE運行Spark。

+0

您需要過濾出標題。另請參閱:http://stackoverflow.com/questions/24299427/how-do-i-convert-csv-file-to-rdd/24307475#24307475 – maasg

+0

謝謝,我拿出頭,但現在當我使用︰val parsedData = data.map {line => val parts = line.split(',') LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split('').map(_。toDouble ))) } .cache(),出現錯誤「值分割不是Array [String]的成員」。你可以幫我嗎? – Mohammad

+0

你正在做'parts(1).split('').map(_。toDouble)',不知道你爲什麼要用空格分割,因爲輸入沒有它們。此外,你改變了你的代碼,我沒有看到你上面提到的錯誤。 –

回答

4

經過大量的努力,我找到了解決方案。第一個問題與標題行有關,第二個問題與映射功能有關。這裏是完整的解決方案:

//To read the file 
val csv = sc.textFile(datadir + "/data_2.csv"); 

//To find the headers 
val header = csv.first; 

//To remove the header 
val data = csv.filter(_(0) != header(0)); 

//To create a RDD of (label, features) pairs 
val parsedData = data.map { line => 
    val parts = line.split(',') 
    LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 
    }.cache() 

我希望它可以節省您的時間。

0

當您在讀取文件的第一行

y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8 

也會被讀取,所以你要通話toDoubley_3map功能轉化。您需要過濾出第一行,並使用其餘行進行學習。

+0

謝謝,我把標題出來,但現在當我使用︰val parsedData = data.map {line => val parts = line.split(',') LabeledPoint(parts(0).toDouble,Vectors.dense部分(1).split('').map(_。toDouble))) } .cache(),出現錯誤「值拆分不是Array [String]的成員」。你可以幫我嗎? – Mohammad

相關問題