2017-08-31 21 views

回答

0

這是因爲數據集上的map導致該查詢在Spark中序列化和反序列化。

要序列化它,Spark現在必須是編碼器。這就是因爲有一個方法適用的對象ExpressionEncoder。這是JavaDoc說:

A factory for constructing encoders that convert objects and primitives to and from the 
    internal row format using catalyst expressions and code generation. By default, the 
    expressions used to retrieve values from an input row when producing an object will be created as 
    follows: 
    - Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions 
    and [[UnresolvedExtractValue]] expressions. 
    - Tuples will have their subfields extracted by position using [[BoundReference]] expressions. 
    - Primitives will have their values extracted from the first ordinal with a schema that defaults 
    to the name `value`. 

請看最後一點。您的查詢僅映射到基元,因此Catalyst使用名稱「值」。

如果添加.select('value.as("MyPropertyName")).as[CaseClass],則字段名稱將是正確的。

類型,將有列名 「值」:

  • 選項(_)
  • 陣列
  • 集合類型,如序列,地圖
  • 類型如String,時間戳,日期,BigDecimal的