2017-09-14 61 views
2

例如,我想根據年齡將DataFrame的人分爲以下4個分箱。我會用pandas.cut()在熊貓中做到這一點。我如何在PySpark中執行此操作?如何在PySpark中裝箱?

age_bins = [ 0, 6, 18, 60, np.Inf ] 
age_labels = [ 'infant', 'minor', 'adult', 'senior' ] 

回答

4

您可以使用Spark中ml庫中的Bucketizer功能transfrom。

values = [("a", 23), ("b", 45), ("c", 10), ("d", 60), ("e", 56), ("f", 2), ("g", 25), ("h", 40), ("j", 33)] 


df = spark.createDataFrame(values, ["name", "ages"]) 


from pyspark.ml.feature import Bucketizer 
bucketizer = Bucketizer(splits=[ 0, 6, 18, 60, float('Inf') ],inputCol="ages", outputCol="buckets") 
df_buck = bucketizer.setHandleInvalid("keep").transform(df) 

df_buck.show() 

輸出

+----+----+-------+ 
|name|ages|buckets| 
+----+----+-------+ 
| a| 23| 2.0| 
| b| 45| 2.0| 
| c| 10| 1.0| 
| d| 60| 3.0| 
| e| 56| 2.0| 
| f| 2| 0.0| 
| g| 25| 2.0| 
| h| 40| 2.0| 
| j| 33| 2.0| 
+----+----+-------+ 

如果你想爲每個名字桶,您可以使用UDF與鬥名

from pyspark.sql.functions import udf 
from pyspark.sql.types import * 

t = {0.0:"infant", 1.0: "minor", 2.0:"adult", 3.0: "senior"} 
udf_foo = udf(lambda x: t[x], StringType()) 
df_buck.withColumn("age_bucket", udf_foo("buckets")).show() 

輸出

+----+----+-------+----------+ 
|name|ages|buckets|age_bucket| 
+----+----+-------+----------+ 
| a| 23| 2.0|  adult| 
| b| 45| 2.0|  adult| 
| c| 10| 1.0|  minor| 
| d| 60| 3.0| senior| 
| e| 56| 2.0|  adult| 
| f| 2| 0.0| infant| 
| g| 25| 2.0|  adult| 
| h| 40| 2.0|  adult| 
| j| 33| 2.0|  adult| 
+----+----+-------+----------+ 
+0

真棒創建新列,謝謝! –