2017-10-16 21 views
2

最新記錄我在它下面的數據(ID,姓名,日期),Apache的星火據幀:前兩名數據幀

ID,Name,DATE 
1,Anil,2000-06-02 
1,Anil,2000-06-03 
1,Anil,2000-06-04 
2,Arun,2000-06-05 
2,Arun,2000-06-06 
2,Arun,2000-06-07 
3,Anju,2000-06-08 
3,Anju,2000-06-09 
3,Anju,2000-06-10 
4,Ram,2000-06-11 
4,Ram,2000-06-02 
4,Ram,2000-06-03 
4,Ram,2000-06-04 
5,Ramu,2000-06-05 
5,Ramu,2000-06-06 
5,Ramu,2000-06-07 
5,Ramu,2000-06-08 
6,Renu,2000-06-09 
7,Gopu,2000-06-10 
7,Gopu,2000-06-11 

但我想ID的前兩名最新紀錄,我想得到以下輸出:

ID,Name,DATE 
1,Anil,2000-06-03 
1,Anil,2000-06-04 
2,Arun,2000-06-06 
2,Arun,2000-06-07 
3,Anju,2000-06-09 
3,Anju,2000-06-10 
4,Ram,2000-06-03 
4,Ram,2000-06-04 
5,Ramu,2000-06-07 
5,Ramu,2000-06-08 
6,Renu,2000-06-09 
7,Gopu,2000-06-10 
7,Gopu,2000-06-11 

我是否需要使用窗口函數,如滯後?

+0

你使用的是什麼dbms? – Matt

+0

這是一個Apache Spark DF。 –

回答

5

使用LEFT OUTER JOINCOUNT < 2.

SELECT d.ID, d.Name, d.Date 
FROM Dataframetable d 
LEFT OUTER JOIN Dataframetable d2 ON d2.ID = d.ID AND d.Date < d2.Date 
GROUP BY d.ID, d.Name, d.Date 
HAVING COUNT(*) < 2 

輸出

ID Name Date 
1 Anil 2000-06-03T00:00:00Z 
1 Anil 2000-06-04T00:00:00Z 
2 Arun 2000-06-06T00:00:00Z 
2 Arun 2000-06-07T00:00:00Z 
3 Anju 2000-06-09T00:00:00Z 
3 Anju 2000-06-10T00:00:00Z 
4 Ram  2000-06-04T00:00:00Z 
4 Ram  2000-06-11T00:00:00Z 
5 Ramu 2000-06-07T00:00:00Z 
5 Ramu 2000-06-08T00:00:00Z 
6 Renu 2000-06-09T00:00:00Z 
7 Gopu 2000-06-10T00:00:00Z 
7 Gopu 2000-06-11T00:00:00Z 

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/1/0

使用子查詢,而不是自連接。

SELECT ID, name, date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date 
FROM Dataframetable d 
GROUP BY d.ID, d.Name 
UNION ALL 
SELECT d.ID, d.Name, MAX(d.Date) 
FROM Dataframetable d 
WHERE d.Date NOT IN 
(SELECT date FROM (SELECT d.ID, d.Name, MAX(d.Date) Date 
FROM Dataframetable d 
GROUP BY d.ID, d.Name) a) 
GROUP BY d.ID, d.Name) b 
ORDER BY ID 

SQL小提琴:http://sqlfiddle.com/#!6/8dcc2/19/0

+0

感謝@Matt任何想法,我們如何能夠做到這一點,而不需要自己加入。 –

+2

@PardeepSharma添加了不需要自我加入的答案,但我會想象會更慢。 – Matt

+0

你可以請分享一個鏈接,可以解釋爲什麼子查詢比自我加入慢。或者如果你可以解釋它會很好。 –

3

感謝@馬特 - 您的解決方案正常工作與我測試的Apache的火花。

val sparkConf = new SparkConf().setAppName("DFTest").setMaster("local[5]") 
    val sc = new SparkContext(sparkConf) 
    val hadoopConf = sc.hadoopConfiguration 
    val sqlContext = new SQLContext(sc) 

    val myFile = sc.textFile("C:\\DFTest\\DFTest.txt") 

    case class Record(id: Int, name: String, datetime : String) 
    val myFile1 = myFile.map(x=>x.split(",")).map { 
    case Array(id, name, datetime) => Record(id.toInt, name,datetime) 
    } 

    import sqlContext.implicits._ 

    val myDF = myFile1.toDF() 

    myDF.registerTempTable("deep_cust") 

    sqlContext.sql("SELECT d.id, d.name, d.datetime FROM deep_cust d " + 
    "LEFT OUTER JOIN deep_cust d2 ON d2.id = d.id AND d.datetime < d2.datetime " + 
    "GROUP BY d.id, d.name, d.datetime " + 
    "HAVING COUNT(*) < 2").show() 

,但它不會直接與蜂巢工作,因爲蜂巢不會支持非相等連接,我們不得不用這樣的RANK一些其他的替代品。

另一種方法:

@馬特可否請你建議我,如果下面的RANK解決方案比連接速度更快。如果不是那麼我們必須使用where子句而不是AND d.Date < d2.Date

select x.id,x.name,x.datetime from (select id,name,datetime,rank() over (partition by id,name order by datetime desc) as rownum from deep_cust) x where x.rownum<3;