如何避免數據集在映射時將列重命名爲值？

在映射數據集時，我始終存在將列從_1，_2等重命名爲值，值的問題。如何避免數據集在映射時將列重命名爲值？

這是什麼導致重命名？

2017-08-31 user3920235

示例代碼將有所幫助:) –

這是因爲數據集上的map導致該查詢在Spark中序列化和反序列化。

要序列化它，Spark現在必須是編碼器。這就是因爲有一個方法適用的對象ExpressionEncoder。這是JavaDoc說：

A factory for constructing encoders that convert objects and primitives to and from the 
    internal row format using catalyst expressions and code generation. By default, the 
    expressions used to retrieve values from an input row when producing an object will be created as 
    follows: 
    - Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions 
    and [[UnresolvedExtractValue]] expressions. 
    - Tuples will have their subfields extracted by position using [[BoundReference]] expressions. 
    - Primitives will have their values extracted from the first ordinal with a schema that defaults 
    to the name `value`.

請看最後一點。您的查詢僅映射到基元，因此Catalyst使用名稱「值」。

如果添加.select('value.as("MyPropertyName")).as[CaseClass]，則字段名稱將是正確的。

類型，將有列名「值」：

選項（_）
陣列
集合類型，如序列，地圖
類型如String，時間戳，日期，BigDecimal的

來源

2017-08-31 15:31:20

如何避免數據集在映射時將列重命名爲值？

回答

相關問題