2016-03-29 72 views
0

我的Scalding作業中有一個records:TypedType[(String, util.List[String])],其中第一個值是一個id,第二個值是一個東西列表。想象一下以下內容:我想只輸出互不相同一個給定的ID記錄在Scalding中生成List [String]的差異

("1", ["a","b","c"]) 
("1", ["a","b","c"]) 
("1", ["a","b","c"]) 
("2", ["a","b"]) 
("2", ["a","b","c"]) 
("3", ["a","b","c"]) 

records.groupBy(_._1)後。對於輸入以上輸出應該是:

("2", ["a","b"]) 
("2", ["a","b","c"]) 

我是新來的Scalding。什麼是實現這一目標的優雅方式?

回答

0

我不知道,如果燙傷方面是你的關鍵(是你的收藏非常巨大?),但在普通老式斯卡拉我會怎麼做:

// Given: 
val records = Seq("1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "1" -> List("a", "b", "c"), "2" -> List("a", "b"), "2" -> List("a", "b", "c"), "3" -> List("a", "b", "c"), "3" -> List("d") 

val distinctValues = records.groupBy(_._1).map { case (k, v) => k -> v.toSet } 
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 1 -> Set((1,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d)))) 

val havingMultipleDistinct = distinctValues.map { case (k, v) => v.size > 1 } 
// => Map(2 -> Set((2,List(a, b)), (2,List(a, b, c))), 3 -> Set((3,List(a, b, c)), (3,List(d)))) 

val asRecords = havingMultipleDistinct.values.flatten 
// => List((2,List(a, b)), (2,List(a, b, c)), (3,List(a, b, c)), (3,List(d))) 
+0

是的,它必須在羣集上運行。燙傷是根本 – Gevorg

0

如果每個值的大小關鍵是足夠小,適合在內存中,然後這樣的事情應該這樣做:

records 
    .group 
    .toSet 
    .filter(_.size > 1) 
    .flatten 

如果它太大了,那麼你就可以用自己加入管:

val grouped = records.group 
grouped 
.join(grouped) 
.collect { case(k, (a, b)) if a != b => k -> a }