2015-05-27 79 views
0

我有一些數據,看起來像這樣:過濾掉重複的行基於列的子集

ID,DateTime,Category,SubCategory 
X01,2014-02-13T12:36:14,Clothes,Tshirts 
X01,2014-02-13T12:37:16,Clothes,Tshirts 
X01,2014-02-13T12:38:33,Shoes,Running 
X02,2014-02-13T12:39:23,Shoes,Running 
X02,2014-02-13T12:40:42,Books,Fiction 
X02,2014-02-13T12:41:04,Books,Fiction 

我想要做的是什麼,只保留每個數據點的一個實例的時間是這樣的(我在時間上並不關心哪一個實例):

ID,DateTime,Category,SubCategory 
X01,2014-02-13T12:36:14,Clothes,Tshirts 
X02,2014-02-13T12:39:23,Shoes,Running 
X02,2014-02-13T12:40:42,Books,Fiction 

不幸的是,根據Hive Language Manual,蜂房的DISTINCT表達工作在這樣做這樣的事情整個表是不是一種選擇:

SELECT DISTINCT(ID, SubCategory), 
     DateTime, 
     Category 
FROM sometable 

我該如何去獲得上面的第二張桌子?提前致謝!

回答

1

對於這種在SQL事情通常的做法是一組由:

select ID, category, subcategory, min(datetime) datetime 
from sometable 
group by ID, category, subcategory