我有以下輸入,用戶在以下百分比(25或50或75或100)中觀看節目。我只是想計算某個特定的百分比用戶在某個ID上。 輸入和輸出如下。如何合併基於Hive中最大值的輸出
輸入
id1, u1, watched25
id2, u1, watched25
id1, u1, watched50
id1, u1, watched75
id3, u1, watched25
id4, u1, watched25
id1, u1, watched100
id2, u1, watched50
id5, u1, watched25
id5, u1, watched50
id5, u1, watched75
id5, u1, watched100
id1, u2, watched25
id1, u2, watched50
id3, u2, watched25
id3, u3, watched25
id1, u2, watched75
id4, u3, watched25
id4, u3, watched50
所需的輸出
id1, u1, watched100
id2, u1, watched50
id3, u1, watched25
id5, u1, watched25
id5, u1, watched100
id1, u2, watched75
id3, u2, watched25
id3, u3, watched25
id4, u3, watched50
首先,我會在第三列中刪除前綴「watched」:在使用存儲方面數值更高效,當您比較值時更實用 – larsen