PySpark中的比較運算符（不等於/！=）

我想獲得數據框中的所有行，其中兩個標誌設置爲'1'，隨後所有那些只有其中一個設置爲'1'的所有行以及其他不等於爲 '1'PySpark中的比較運算符（不等於/！=）

隨着下面的模式（三列），

df = sqlContext.createDataFrame([('a',1,'null'),('b',1,1),('c',1,'null'),('d','null',1),('e',1,1)], #,('f',1,'NaN'),('g','bla',1)], 
          schema=('id', 'foo', 'bar') 
          )

我獲得以下數據幀：

+---+----+----+ 
| id| foo| bar| 
+---+----+----+ 
| a| 1|null| 
| b| 1| 1| 
| c| 1|null| 
| d|null| 1| 
| e| 1| 1| 
+---+----+----+

當我應用程式LY期望過濾器，第一過濾器（富= 1 AND巴= 1）的工作原理，而不是其他的（富= 1 AND NOT巴= 1）

foobar_df = df.filter((df.foo==1) & (df.bar==1))

收率：

+---+---+---+ 
| id|foo|bar| 
+---+---+---+ 
| b| 1| 1| 
| e| 1| 1| 
+---+---+---+

以下是不行爲的過濾器：

foo_df = df.filter((df.foo==1) & (df.bar!=1)) 
foo_df.show() 
+---+---+---+ 
| id|foo|bar| 
+---+---+---+ 
+---+---+---+

爲什麼不過濾？如何獲得只有foo等於'1'的列？

來源

2016-08-24 Hendrik F

要過濾空值嘗試：

foo_df = df.filter((df.foo==1) & (df.bar.isNull()))

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

來源

2016-08-24 10:39:13 johnaphun

爲什麼不過濾

因爲它是SQL和NULL表示遺漏值。因爲與NULL的任何比較，IS NULL和IS NOT NULL都是未定義的。您需要用：

col("bar").isNull() | (col("bar") != 1)

或

coalesce(col("bar") != 1, lit(True))

或（PySpark >= 2.3）：

col("bar").eqNullSafe(1)

如果你想在PySpark空安全比較

。

另外'null'不是引入NULL文字的有效方法。您應該使用None來指示丟失的對象。

from pyspark.sql.functions import col, coalesce, lit 

df = spark.createDataFrame([ 
    ('a', 1, 1), ('a',1, None), ('b', 1, 1), 
    ('c' ,1, None), ('d', None, 1),('e', 1, 1) 
]).toDF('id', 'foo', 'bar') 

df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show() 

## +---+---+----+ 
## | id|foo| bar| 
## +---+---+----+ 
## | a| 1|null| 
## | c| 1|null| 
## +---+---+----+ 

df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show() 

## +---+---+----+ 
## | id|foo| bar| 
## +---+---+----+ 
## | a| 1|null| 
## | c| 1|null| 
## +---+---+----+

來源

2016-08-24 11:07:31 zero323

PySpark中的比較運算符（不等於/！=）

回答

相關問題