PySpark：減去數據幀忽略某些列

我想執行pyspark中2個數據幀之間的相減。挑戰是我必須在減去數據幀的同時忽略一些列。但是結束數據框應該包含所有列，包括被忽略的列。PySpark：減去數據幀忽略某些列

下面是一個例子：

userLeft = sc.parallelize([ 
    Row(id=u'1', 
     first_name=u'Steve', 
     last_name=u'Kent', 
     email=u'[email protected]', 
     date1=u'2017-02-08'), 
    Row(id=u'2', 
     first_name=u'Margaret', 
     last_name=u'Peace', 
     email=u'[email protected]', 
     date1=u'2017-02-09'), 
    Row(id=u'3', 
     first_name=None, 
     last_name=u'hh', 
     email=u'[email protected]', 
     date1=u'2017-02-10') 
]).toDF() 

userRight = sc.parallelize([ 
    Row(id=u'2', 
     first_name=u'Margaret', 
     last_name=u'Peace', 
     email=u'[email protected]', 
     date1=u'2017-02-11'), 
    Row(id=u'3', 
     first_name=None, 
     last_name=u'hh', 
     email=u'[email protected]', 
     date1=u'2017-02-12') 
]).toDF()

預計：

ActiveDF = userLeft.subtract(userRight) ||| Ignore "date1" column while subtracting.

最終結果應該是這個樣子，包括「日期1」欄目。

+----------+--------------------+----------+---+---------+ 
|  date1|    email|first_name| id|last_name| 
+----------+--------------------+----------+---+---------+ 
|2017-02-08| [email protected]|  Steve| 1|  Kent| 
+----------+--------------------+----------+---+---------+

來源

2017-09-06 orNehPraka

看來你需要anti-join：

userLeft.join(userRight, ["id"], "leftanti").show() 
+----------+----------------+----------+---+---------+ 
|  date1|   email|first_name| id|last_name| 
+----------+----------------+----------+---+---------+ 
|2017-02-08|[email protected]|  Steve| 1|  Kent| 
+----------+----------------+----------+---+---------+

來源

2017-09-06 16:16:15 Psidom

「leftanti」不是pyspark 1.6可用。我沒有這些數據框的任何特定的主鍵。我的數據幀在運行時生成。所以，我不知道這是專欄的細節。但我一直都知道，加入時我不想考慮哪些列。 – orNehPraka

一個選項是'userLeft.join（userRight，[col for userLeft.columns if col！='date1']，「leftanti」）'如果你想加入除'date1'之外的所有列，但這是不是null安全的，您可能需要在執行此操作之前用空字符串填充空值。 – Psidom

你也可以使用一個full join只保留null值：

userLeft.join(
    userRight, 
    [c for c in userLeft.columns if c != "date1"], 
    "full" 
).filter(psf.isnull(userLeft.date1) | psf.isnull(userRight.date1)).show() 

    +------------------+----------+---+---------+----------+----------+ 
    |    email|first_name| id|last_name|  date1|  date1| 
    +------------------+----------+---+---------+----------+----------+ 
    |[email protected]|  null| 3|  hh|2017-02-10|  null| 
    |[email protected]|  null| 3|  hh|  null|2017-02-12| 
    | [email protected]|  Steve| 1|  Kent|2017-02-08|  null| 
    +------------------+----------+---+---------+----------+----------+

如果你想使用連接，無論是leftanti或full您需要爲您的null找到默認值加入列（我想我們在之前的線程中討論過）。

您也可以直接drop你煩惱subtract和join列：

df = userLeft.drop("date1").subtract(userRight.drop("date1")) 
userLeft.join(df, df.columns).show() 

    +----------------+----------+---+---------+----------+ 
    |   email|first_name| id|last_name|  date1| 
    +----------------+----------+---+---------+----------+ 
    |[email protected]|  Steve| 1|  Kent|2017-02-08| 
    +----------------+----------+---+---------+----------+

來源

2017-09-06 18:27:54 MaFF

這是一個生產數據。我無法觸及NULL併爲其分配一個默認值。 – orNehPraka

PySpark：減去數據幀忽略某些列

回答

相關問題