CSV到對象數組

我爲使用「PointFeature」對象的火花使用第三方包。我正在嘗試一個csv文件，並將每行的元素形成一個這些PointFeature對象的數組。CSV到對象數組

的PointFeature構造爲我執行是這樣的：

Feature(Point(_c1, _c2), _c3)

其中_c1，_c2和_c3是我的CSV的列和代表雙打。

這裏是我當前的嘗試：

val points: Array[PointFeature[Double]] = for{ 
    line <- sc.textFile("file.csv") 
    point <- Feature(Point(line._c1,line._c2),line._c3) 
} yield point

引用的列

<console>:36: error: value _c1 is not a member of String 
    point <- Feature(Point(line._c1,line._c2),line._c3.toDouble) 
          ^
<console>:36: error: value _c2 is not a member of String 
     point <- Feature(Point(line._c1,line._c2),line._c3.toDouble) 
              ^
<console>:36: error: value _c3 is not a member of String 
     point <- Feature(Point(line._c1,line._c2),line._c3.toDouble) 
                ^

這顯然是因爲我引用一個字符串，如果它是一個元素時我的錯誤顯示出來數據幀。我想知道是否有以這種循環格式處理DataFrames的方法，或者將每行分割爲雙精度列表的方式。也許我需要一個RDD？我不確定這會產生一個數組。其實，我懷疑它會返回一個RDD ...

我使用星火2.1.0在Amazon EMR

下面是我從得出了一些其他問題：

How to read csv file into an Array of arrays in scala

Splitting strings in Apache Spark using Scala

How to iterate records spark scala?

來源

2017-04-25 user306603

你可以建立一個Dataset[Feature]這樣說：

case class Feature(x: Double, y: Double, z: Double) 
sparkSession.read.csv("file.csv") 
    .toDF("x", "y", "z") 
    .withColumn("x", 'x.cast(DoubleType)) 
    .withColumn("y", 'y.cast(DoubleType)) 
    .withColumn("z", 'z.cast(DoubleType)) 
    .as[Feature]

然後你就可以消耗你的強類型DataSet[Feature]你認爲合適的。

來源

2017-04-25 22:47:23 Vidya

我建議在較小的步驟中進行此操作。

第一步

讓您的行作爲一個數組/列表/任何字符串。

val lines = sc.textFile("file.txt").getLines，或類似的東西。

第二步

玩轉你的行自己列的清單。

val splits = lines.map {l => l.split(",")}

第三步

提取您的colums爲可以使用

val res = splits.map { 
    case Array(col1, col2, col3) => // Convert to doubles, put in to Feature/Point Structure} 
    case _ => // Handle the case where your csv is malformatted 
}

這都可以在一氣呵成完成丘壑，我只是將它們分割顯示邏輯步驟從文件→列表字符串→列表列表字符串→列表功能

來源

2017-04-25 23:10:00 Charles

CSV到對象數組

回答

相關問題