2016-04-09 18 views
-1

我有兩個數據幀,它們代表同一個人的兩個不同的時間段。我希望瞭解,對於每一行,是否在兩個數據幀的5(固定)列中有任何更改。如何檢查屬於兩個數據幀的行中的差異

前:

+--+------+------+------+------+------+------+ 
|id| sport| var1| var2| var3| var4| var5| 
+--+------+------+------+------+------+------+ 
| 1|soccer|330234|  |  |  |  | 
| 2|soccer| null| null| null| null| null| 
| 3|soccer|330101|  |  |  |  | 
| 4|soccer| null| null| null| null| null| 
| 5|soccer| null| null| null| null| null| 
| 6|soccer| null| null| null| null| null| 
| 7|soccer| null| null| null| null| null| 
| 8|soccer|330024|330401|  |  |  | 
| 9|soccer|330055|330106|  |  |  | 
|10|soccer| null| null| null| null| null| 
|11|soccer|390027|  |  |  |  | 
|12|soccer| null| null| null| null| null| 
|13|soccer|330101|  |  |  |  | 
|14|soccer|330059|  |  |  |  | 
|15|soccer| null| null| null| null| null| 
|16|soccer|140242|140281|  |  |  | 
|17|soccer|330214|  |  |  |  | 
|18|soccer|  |  |  |  |  | 
|19|soccer|330055|330196|  |  |  | 
|20|soccer|210022|  |  |  |  | 
+--+------+------+------+------+------+------+ 

後:

+--+------+------+------+------+------+------+ 
|id| sport| var1| var2| var3| var4| var5| 
+--+------+------+------+------+------+------+ 
| 1|soccer|330234|  |  |  |  | 
| 2|soccer| null| null| null| null| null| 
| 3|soccer|330101|  |  |  |  | 
| 4|soccer| null| null| null| null| null| 
| 5|soccer| null| null| null| null| null| 
| 6|soccer| null| null| null| null| null| 
| 7|soccer| null| null| null| null| null| 
| 8|soccer| null| null| null| null| null| 
| 9|soccer|330106|  |  |  |  | 
|10|soccer| null| null| null| null| null| 
|11|soccer|390027|  |  |  |  | 
|12|soccer| null| null| null| null| null| 
|13|soccer| null| null| null| null| null| 
|14|soccer|330128|330331|330106|330059|  | 
|15|soccer| null| null| null| null| null| 
|16|soccer|140242|140281|140010|  |  | 
|17|soccer|330214|  |  |  |  | 
|18|soccer| null| null| null| null| null| 
|19|soccer|330196|  |  |  |  | 
|20|soccer|210022|  |  |  |  | 
+--+------+------+------+------+------+------+ 

我知道如何掃描在屬於行的列的差異,但我很無能如何比較兩個不同的數據幀的行。

一個理想的輸出是:

+--+------+------+ 
|id| sport| diff| 
+--+------+------+ 
| 1|soccer|  0| 
| 2|soccer|  0| 
| 3|soccer|  0| 
| 4|soccer|  0| 
| 5|soccer|  0| 
| 6|soccer|  0| 
| 7|soccer|  0| 
| 8|soccer|  1| 
| 9|soccer|  1| 
|10|soccer|  0| 
|11|soccer|  0| 
|12|soccer|  0| 
|13|soccer|  1| 
|14|soccer|  1| 
|15|soccer|  0| 
|16|soccer|  1| 
|17|soccer|  0| 
|18|soccer|  0| 
|19|soccer|  1| 
|20|soccer|  0| 

回答

2

你的意思是這樣的?讓我們開始示例性數據:

val before = Seq(
    (1, "soccer", Some(1), Some(2), Some(3), Some(4), None), 
    (2, "soccer", None, Some(0), None, None, Some(0)), 
    (3, "soccer", None, None, None, None, None) 
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5") 

val after = Seq(
    (1, "soccer", Some(1), Some(2), Some(3), Some(4), None), // Zero diffs 
    (2, "soccer", Some(1), Some(0), None, None, Some(0)), // One diff 
    (3, "soccer", Some(1), Some(1), Some(1), Some(1), Some(1)) // Five diffs 
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5") 

生成用於計數差的表達式:

// Extract var columns 
val varCols = before.columns.drop(2) 

// Generate a list of exprs 
// CAST(NOT(before.var1 <=> after.var1) AS INT) 
val equalsExprs = varCols.map(
    c => not(col(s"before.$c") <=> col(s"after.$c")).cast("int").alias(s"${c}_ne")) 

// SUM 
val diff = equalsExprs.foldLeft(lit(0))(_ + _).alias("diff") 

它將治療:

  • 2空值作爲等於
  • 任何值,並且NULL不 - 相等
  • 兩個非空值 - 標準類型相等

加入,並選擇表達:

val diffs = before.as("before").join(after.as("after"), Seq("id", "sport")) 
    .select($"id", $"sport", diff) 

diffs.show 

// +---+------+----+ 
// | id| sport|diff| 
// +---+------+----+ 
// | 1|soccer| 0| 
// | 2|soccer| 1| 
// | 3|soccer| 5| 
// +---+------+----+ 
+0

我想知道是否可以寫,不僅計算差異的表現,但也明白,如果這些差異是加法或減法,以目前的狀態。說'有些(1),一些(2),無,無,無',之後像'Some(1),Some(2),Some(3),Some(4),None' '沒有,沒有,沒有,沒有,沒有...... ...兩個都改變了,但是在第一種情況下它是+2,而在第二種情況下是-2 – user299791

相關問題