如何在Spark DataFrame中添加常量列？

我想添加一個DataFrame與一些任意值（這是每行相同）的列。我得到一個錯誤，當我使用withColumn如下：如何在Spark DataFrame中添加常量列？

dt.withColumn('new_column', 10).head(5) 

--------------------------------------------------------------------------- 
AttributeError       Traceback (most recent call last) 
<ipython-input-50-a6d0257ca2be> in <module>() 
     1 dt = (messages 
     2  .select(messages.fromuserid, messages.messagetype, floor(messages.datetime/(1000*60*5)).alias("dt"))) 
----> 3 dt.withColumn('new_column', 10).head(5) 

/Users/evanzamir/spark-1.4.1/python/pyspark/sql/dataframe.pyc in withColumn(self, colName, col) 
    1166   [Row(age=2, name=u'Alice', age2=4), Row(age=5, name=u'Bob', age2=7)] 
    1167   """ 
-> 1168   return self.select('*', col.alias(colName)) 
    1169 
    1170  @ignore_unicode_prefix 

AttributeError: 'int' object has no attribute 'alias'

看來我可以欺騙功能進入工作，因爲我想通過加法和減法其他支柱之一（這樣它們添加到零），然後加入我想要的數字（在這種情況下爲10）：

dt.withColumn('new_column', dt.messagetype - dt.messagetype + 10).head(5) 

[Row(fromuserid=425, messagetype=1, dt=4809600.0, new_column=10), 
Row(fromuserid=47019141, messagetype=1, dt=4809600.0, new_column=10), 
Row(fromuserid=49746356, messagetype=1, dt=4809600.0, new_column=10), 
Row(fromuserid=93506471, messagetype=1, dt=4809600.0, new_column=10), 
Row(fromuserid=80488242, messagetype=1, dt=4809600.0, new_column=10)]

這是至高無上的，對嗎？我認爲有更合理的方式來做到這一點？

來源

2015-09-25 Evan Zamir

101

火花2.2+

火花2.2介紹typedLit支持Seq，Map，和Tuples（SPARK-19254）和下面的調用應支持（斯卡拉）：

import org.apache.spark.sql.functions.typedLit 

df.withColumn("some_array", typedLit(Seq(1, 2, 3))) 
df.withColumn("some_struct", typedLit(("foo", 1, .0.3))) 
df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2)))

火花1.3+ （3210），1.4+（array,struct），2.0+（map）：

爲DataFrame.withColumn第二個參數應該是一個Column所以你必須使用文字：

from pyspark.sql.functions import lit 

df.withColumn('new_column', lit(10))

如果需要複雜的欄，你可以建造這些使用塊像array：

from pyspark.sql.functions import array, create_map, struct 

df.withColumn("some_array", array(lit(1), lit(2), lit(3))) 
df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3))) 
df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2)))

在Scala中可以使用完全相同的方法。

import org.apache.spark.sql.functions.{array, lit, map, struct} 

df.withColumn("new_column", lit(10)) 
df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2)))

儘管速度較慢，但也可以使用UDF。

來源

2015-09-25 18:40:15 zero323

-1

在火花2.2有兩種方法可以在數據幀中添加的列的恆定值：

1）使用3210

2）使用typedLit。

兩者之間的區別在於typedLit也可以處理參數化的scala類型，例如表，序列和地圖

樣品數據框：

val df = spark.createDataFrame(Seq((0,"a"),(1,"b"),(2,"c"))).toDF("id", "col1") 

+---+----+ 
| id|col1| 
+---+----+ 
| 0| a| 
| 1| b| 
+---+----+

1）使用3210：添加新列命名NEWCOL常量字符串值：

import org.apache.spark.sql.functions.lit 
val newdf = df.withColumn("newcol",lit("myval"))

結果：

+---+----+------+ 
| id|col1|newcol| 
+---+----+------+ 
| 0| a| myval| 
| 1| b| myval| 
+---+----+------+

2）使用typedLit：

import org.apache.spark.sql.functions.typedLit 
df.withColumn("newcol", typedLit(("sample", 10, .044)))

結果：

+---+----+-----------------+ 
| id|col1|   newcol| 
+---+----+-----------------+ 
| 0| a|[sample,10,0.044]| 
| 1| b|[sample,10,0.044]| 
| 2| c|[sample,10,0.044]| 
+---+----+-----------------+

來源

2018-02-26 09:05:21

凡下來投它請提供解釋。 –

如何在Spark DataFrame中添加常量列？

回答

相關問題