2013-11-21 47 views
1

我正在嘗試在github存檔(http://www.githubarchive.org/)數據上使用Google BigQuery數據來獲取最新事件發生時的存儲庫統計信息,而我試圖以最多的觀察者來獲取這個存儲庫。我意識到這是很多,但我覺得我真的接近於在一個查詢中得到它。Google BigQuery:如何爲查詢結果中的值獲取不同的行

這是查詢我現在有:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time 
FROM [githubarchive:github.timeline] 
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language 
ORDER BY watchers DESC, time DESC 
LIMIT 1000 

唯一的問題是,我得到的是從最高看着庫中的所有事件(Twitter的引導):

結果:

Row repository_name repository_owner repository_organization repository_size watchers forks repository_language time  
1 bootstrap   twbs     twbs     83875  61191  21602 JavaScript   1384991582000000  
2 bootstrap   twbs     twbs     83875  61190  21602 JavaScript   1384991337000000  
3 bootstrap   twbs     twbs     83875  61190  21603 JavaScript   1384989683000000 

...

我怎麼才能得到這個返回單個結果(t他最近,又名Max(time))爲一個repository_name?

我已經試過:

SELECT repository_name, repository_owner, repository_organization, repository_size, repository_watchers as watchers, repository_forks as forks, repository_language, MAX(PARSE_UTC_USEC(created_at)) as time 
FROM [githubarchive:github.timeline] 
WHERE PARSE_UTC_USEC(created_at) IN (SELECT MAX(PARSE_UTC_USEC(created_at)) FROM [githubarchive:github.timeline]) 
GROUP EACH BY repository_name, repository_owner, repository_organization, repository_size, watchers, forks, repository_language 
ORDER BY watchers DESC, time DESC 
LIMIT 1000 

或不肯定不是,如果這樣做工作,但它並不重要,因爲我得到的錯誤信息:

Error: Join attribute is not defined: PARSE_UTC_USEC 

任何幫助將是巨大的,謝謝。

回答

4

該查詢的一個問題是,如果有兩個操作同時發生,您的結果可能會感到困惑。如果按存儲庫名稱進行分組以獲得每個存儲庫的最大提交時間,然後再加入以獲得所需的其他字段,則可以獲得所需的結果。例如:

SELECT 
    a.repository_name as name, 
    a.repository_owner as owner, 
    a.repository_organization as organization, 
    a.repository_size as size, 
    a.repository_watchers AS watchers, 
    a.repository_forks AS forks, 
    a.repository_language as language, 
    PARSE_UTC_USEC(created_at) AS time 
FROM [githubarchive:github.timeline] a 
JOIN EACH 
    (
    SELECT MAX(created_at) as max_created, repository_name 
    FROM [githubarchive:github.timeline] 
    GROUP EACH BY repository_name 
) b 
    ON 
    b.max_created = a.created_at and 
    b.repository_name = a.repository_name 
ORDER BY watchers desc 
LIMIT 1000 
+0

是的,非常感謝。這正是我所需要的。 – brycek

+1

只需對此解決方案發表一條評論。如果存儲庫具有相同的名稱,則會導致問題,所以我添加了:'b.repository_owner = a.repository_owner'。 – brycek

相關問題