2017-06-21 39 views
0

案例1個合併如何連接兩個DataFrames並與多個主鍵更新涉及遺漏值?

舊數據框:

## +---+----+----+---+ 
## |pk1|pk2|val1|val2| 
## +---+----+----+---+ 
## | 1| aa| ab| ac| 
## | 2| bb| bc| bd| 
## +---+----+----+---+ 

新數據框中:

## +---+----+----+---+ 
## |pk1|pk2|val1|val2| 
## +---+----+----+---+ 
## | 1| aa| ab| ad| 
## | 2| bb| bb| bd| 
## | 3| cc| cc| cc| 
## +---+----+----+---+ 

結果:

## +---+----+----+---+ 
## |pk1|pk2|val1|val2| 
## +---+----+----+---+ 
## | 1| aa| ab| ad| 
## | 2| bb| bb| bd| 
## | 3| cc| cc| cc| 
## +---+----+----+---+ 

是否外與多個鍵加入是否行得通呢?

+6

是怎樣的結果數據集從數據幀新的不同? –

回答

1

從您的示例數據中,我認爲新數據框中的元素將在舊數據框中被選取,只要它們不同。

[更新]隨着​​VAL-列是動態的,可以申請foldLeft到列名單如下:

val dfOld = Seq(
    (1, "aa", "ab", "ac"), 
    (2, "bb", "bc", "bd") 
).toDF("pk1", "pk2", "val1", "val2") 

val dfNew = Seq(
    (1, "aa", "ab", "ad"), 
    (2, "bb", "bb", "bd"), 
    (3, "cc", "cc", "cc") 
).toDF("pk1", "pk2", "val1", "val2") 

// Assemble the list of selected val-columns 
val valColumns = dfNew.columns.filter(x => x != "pk1" && x != "pk2") 

val dfJoined = dfNew.join(dfOld, Seq("pk1", "pk2"), "left_outer") 

// Generate diff-columns from the val-column list 
val dfDiff = valColumns.foldLeft(dfJoined)((acc, x) => 
    acc.withColumn(
    x + "diff", 
    when(!(dfNew(x) === dfOld(x)) || (dfOld(x).isNull), dfNew(x)).otherwise(null) 
). 
    drop(x) 
) 

dfDiff.show 
+---+---+--------+--------+ 
|pk1|pk2|val1diff|val2diff| 
+---+---+--------+--------+ 
| 1| aa| null|  ad| 
| 2| bb|  bb| null| 
| 3| cc|  cc|  cc| 
+---+---+--------+--------+ 
+0

謝謝,它的工作原理。但有一件事我也想得到的只是更新的列而不是所有的列。我無法得到一個清楚提及它的機會。 – ashK

+0

@ashK,請參閱最新的答案。 –

+0

由於它是有幫助的。現在我將不得不在這些動態列上工作。 – ashK