2017-04-12 89 views
0

我有一系列來自一系列曲棍球比賽的得分數據,並且我處於分析階段。我試圖在每場比賽中每10分鐘畫出主隊的領先優勢。根據得分數據定期計算球隊領先優勢

這裏是哪裏我到目前爲止已經得到我的數據集的例子:

library(tidyverse) 

# Generate example data ordered by gameid and event_ts 
game <- tibble(event_type = "goal", event_ts = runif(n = 1000, min = 0, max = 60), 
     team = sample(c("home", "away"), size = 1000, replace = TRUE, prob = c(0.55,0.45)), 
     gameid = sample(100:300, size = 1000, replace = TRUE)) %>% 
    arrange(gameid, event_ts) 

我知道,我可以用summarise每場比賽的最終比分。下面是一個假設兩隊得分至少一個目標在每場比賽一個簡單的例子:

game %>% 
    group_by(gameid, team) %>% 
    summarise(goals = n()) %>% 
    spread(key = team, value = goals) %>% 
    mutate(away = ifelse(is.null(away), 0, away)) 

我想在整個遊戲10分鐘間隔計算出主隊的領先優勢(正或負)。這需要總結那時發生的所有得分。這裏有一個我想要得到的結構的例子:

finished_demo <- tibble(
    gameid = sort(rep_len(seq(100, 300, 1), 1206)), 
    timestamp = rep(seq(10, 60, 10), 201), 
    home_lead = round(runif(
    n = 1206, min = -5, max = 7 
)) 
) %>% arrange(gameid, timestamp) 
+1

'庫(tidyverse);遊戲%>%mutate(event_ts = ceiling(event_ts/10)* 10)%>%complete(event_ts,gameid,team)%>%group_by(gameid,team,event_ts)%>%summarize(score = coalesce(sum %>總結(ts = list(event_ts),score = list(cumsum(得分)))%>%unnest()%>%spread(團隊,分數)%> %mutate(home_lead = home - away)' – alistaire

回答

4

下面是使用data.table完成它的一種方式,IIUC:

require(data.table) 
setDT(game) # generated with op's code but with a seed(1L) 

key <- CJ(gameid=unique(game$gameid), start=1L, end=(1:6)*10L) 
ans <- game[key, on=.(gameid, event_ts >= start, event_ts <= end), # (1) 
       .(home_lead=sum(team == "home")-sum(team == "away")), # (2) 
       by=.EACHI]           # (3) 

head(ans) 
# gameid event_ts event_ts home_lead 
# 1: 100  1  10  NA 
# 2: 100  1  20   1 
# 3: 100  1  30   0 
# 4: 100  1  40   0 
# 5: 100  1  50  -1 
# 6: 100  1  60  -2 

可以重命名重複的列名(我會當我得到的時間進行這項工作解決這個問題)。


(1)搜索在game行索引爲每一行中key而上下on參數所提供的條件匹配的匹配。 (2)計算home組的領先優勢。

(3).EACHI通知主隊領先應該爲key每一行匹配的game行來計算。

NA意味着沒有匹配的事件。如果有必要,他們可以通過做代替,以0

ans[is.na(home_lead), home_lead := 0L] 
1

這個怎麼樣?

game %>% 
mutate(ten_min = event_ts %/% 10, 
     homegoal = if_else(team == 'home', 1, -1)) %>% 
group_by(ten_min, gameid) %>% 
summarize(home_lead_interval = sum(homegoal)) %>% 
ungroup() %>% 
group_by(gameid) %>% 
mutate(home_lead = cumsum(home_lead_interval)) %>% 
arrange(gameid, ten_min) 
# Source: local data frame [683 x 4] 
# Groups: gameid [198] 
# 
# ten_min gameid home_lead_interval home_lead 
#  <dbl> <int>    <dbl>  <dbl> 
# 1  0 100     0   0 
# 2  1 100     -1  -1 
# 3  2 100     -3  -4 
# 4  3 100     -1  -5 
# 5  4 100     2  -3 
# 6  5 100     -1  -4 
# 7  1 101     1   1 
# 8  2 101     1   2 
# 9  4 101     -2   0 
# 10  0 102     1   1 
# # ... with 673 more rows 
1

我很喜歡99%,肯定會有人可以與一些嵌入在purrr發現/嵌套(?)結構重寫這個。不同的nrow()從上面的結果(使用相同的數據),所以不能保證解決方案是正確的。

game %>% 
     group_by(gameid) %>% 
     do(data.frame(time = 10 * (1:(max(.$event_ts) %/% 10)))) %>% 
     apply(1, function(x) { 
          g = x[1] %>% unlist 
          t = x[2] %>% unlist 
          game %>% 
           filter(gameid == g, event_ts < t) %>% 
           group_by(gameid, team) %>% 
           summarise(goals = n()) %>% 
           mutate(time = t) 
          }) %>% 
     bind_rows %>% 
     spread(key = team, value = goals) %>% 
     mutate_all(as.numeric) %>% 
     mutate(away = ifelse(is.na(away), 0, away), 
      home = ifelse(is.na(home), 0, home)) 


    gameid time away home 
    <int> <dbl> <dbl> <dbl> 
1  100 10  0  1 
2  100 20  1  3 
3  100 30  1  3 
4  101 20  0  1 
5  101 30  1  1 
6  101 40  1  2 
7  101 50  1  2 
1

我的想法是每10分鐘獲得主客場比分。然後你可以將基於gameid的data.frame分組,並創建你想要的結果。

set.seed(123) 
# Generate example data ordered by gameid and event_ts 
game <- tibble(event_type = "goal", event_ts = runif(n = 1000, min = 0, max = 60), 
       team = sample(c("home", "away"), size = 1000, replace = TRUE, prob = c(0.55,0.45)), 
       gameid = sample(100:300, size = 1000, replace = TRUE)) %>% 
    arrange(gameid, event_ts) 

# Change the event_ts and get all 10 minutes intervals 
hl <- game %>% 
    mutate(event_ts=ceiling(event_ts/10) * 10) %>% 
    dcast(gameid + event_ts ~ team, length) %>% 
    right_join(expand.grid(gameid=unique(game$gameid), event_ts=seq(10, 60, 10))) 
hl$away[is.na(hl$away)] <- 0 
hl$home[is.na(hl$home)] <- 0 
# Get the home lead 
hl <- hl %>% 
    arrange(gameid, event_ts) %>% 
    group_by(gameid) %>% 
    mutate(away=cumsum(away), 
     home=cumsum(home), 
     home_lead=home - away) 

# Check the game 100 and 101 
game %>% filter(gameid %in% 100:101) 
# A tibble: 7 × 4 
    event_type event_ts team gameid 
     <chr>  <dbl> <chr> <int> 
1  goal 30.460972 home 100 
2  goal 57.270219 home 100 
3  goal 1.126093 home 101 
4  goal 27.879957 home 101 
5  goal 33.086101 home 101 
6  goal 42.497419 away 101 
7  goal 45.649418 home 101 

hl %>% filter(gameid %in% 100:101) 
Source: local data frame [12 x 5] 
Groups: gameid [2] 

    gameid event_ts away home home_lead 
    <int> <dbl> <dbl> <dbl>  <dbl> 
1  100  10  0  0   0 
2  100  20  0  0   0 
3  100  30  0  0   0 
4  100  40  0  1   1 
5  100  50  0  1   1 
6  100  60  0  2   2 
7  101  10  0  1   1 
8  101  20  0  1   1 
9  101  30  0  2   2 
10 101  40  0  3   3 
11 101  50  1  4   3 
12 101  60  1  4   3