1

二元響應我現在有一個RDD在那裏我有兩列這是創建轉換時間變量

Row(pickup_time=datetime.datetime(2014, 2, 9, 14, 51) 
    dropoff_time=datetime.datetime(2014, 2, 9, 14, 58) 

我希望將這些轉化爲二元響應變量,其中1將表明夜間和功能0表示白天。

我知道我們可以使用UserDefinedFunction來創建一個函數,以便將它們更改爲所需的格式。

比如我有另一列是指定的支付類型是「CSH」或「CRD」,所以我能解決,這樣做

pay_map = {'CRD':1.0, 'CSH':0.0} 
pay_bin = UserDefinedFunction(lambda z: pay_map[z], DoubleType()) 
df = df.withColumn('payment_type', pay_bin(df['payment_type'])) 

我將如何應用此相同的邏輯字符串到我問的問題?如果這有助於我嘗試轉換這些變量,因爲我將運行一個決策樹。

回答

1

這裏不需要UDF。您可以使用between和類型轉換:

from pyspark.sql.functions import hour 

def in_range(colname, lower_bound=6, upper_bound=17): 
    """ 
    :param colname - Input column name (str) 
    :lower_bound - Lower bound for day hour (int, 0-23) 
    :upper_bound - Upper bound for day hour (int, 0-23) 
    """ 
    assert 0 <= lower_bound <= 23 
    assert 0 <= upper_bound <= 23 

    if lower_bound < upper_bound: 
     return hour(colname).between(lower_bound, upper_bound).cast("integer") 
    else: 
     return (
      (hour(colname) >= lower_bound) | 
      (hour(colname) <= upper_bound) 
     ).cast("integer") 

用法示例:

df = sc.parallelize([ 
    Row(
     pickup_time=datetime.datetime(2014, 2, 9, 14, 51), 
     dropoff_time=datetime.datetime(2014, 2, 9, 14, 58) 
    ), 
    Row(
     pickup_time=datetime.datetime(2014, 2, 9, 19, 51), 
     dropoff_time=datetime.datetime(2014, 2, 9, 1, 58) 
    ) 
]).toDF() 

(df 
    .withColumn("dropoff_during_day", in_range("dropoff_time")) 
    # between 6pm and 5am 
    .withColumn("pickpup_during_night", in_range("pickup_time", 18, 5))) 
+--------------------+--------------------+------------------+--------------------+ 
|  dropoff_time|   pickup_time|dropoff_during_day|pickpup_during_night| 
+--------------------+--------------------+------------------+--------------------+ 
|2014-02-09 14:58:...|2014-02-09 14:51:...|     1|     0| 
|2014-02-09 01:58:...|2014-02-09 19:51:...|     0|     1| 
+--------------------+--------------------+------------------+--------------------+