2017-07-27 68 views
3

我,使用R 3.3.2。創建數據框的子集預測和追加到原始文件

我想預測機構的得分基於他們在往年的分數不同subrankings。然後,我需要將這些預測分數作爲新行添加到原始數據框中。我的輸入是一個CSV文件

我想用最小二乘線性模型,發現「LM」和「預測」不正是我需要的。

我知道這一個相當初級的問題,但希望有人能幫助我。請參閱下面的數據和代碼,我已經開始使用兩種解決方案。

score<-c(63.6, 60.3, 60.4, 53.4, 46.5, 65.8, 45.8, 65.9, 
44.9, 60, 83.5, 81.7, 81.2, 78.8, 83.3, 79.4, 83.2, 77.3, 
79.4) 

year<-c(2013, 2014, 2015, 2016, 2014, 2014, 2015, 2015, 
2016, 2016, 2011, 2012, 2013, 2014, 2014, 2015, 2015, 
2016, 2016) 

institution<-c(1422, 1422, 1422, 1422, 1384, 1422, 1384, 
1422, 1384, 1422, 1384, 1384, 1384, 1422, 1384, 1422, 
1384, 1422, 1384) 

subranking<-c('CMP', 'CMP', 'CMP', 'CMP', 'SSC', 'SSC', 'SSC', 
'SSC', 'SSC', 'SSC', 'ETC', 'ETC', 'ETC', 'ETC', 'ETC', 'ETC', 
'ETC', 'ETC', 'ETC') 

d <- data.frame(score, year, institution,subranking) 


#-----------SOLUTION 1 ------------------- 

p<- unique(d$institution) 
for (i in (1:length(p))){ 
    x<- d$score[d$institution==p[i]] 
    y<- d$year[d$institution==p[i]] 
    model<- lm(x~y) 
    result<-predict(model, data.frame(y = c(2017,2018,2019,2020))) 
    z<- cbind(result,data.frame(y = c(2017,2018,2019,2020))) 
    print(z) 
} 

##----------SOLUTION 2 ------------------- 

calculate_predicted_scores <- function(scores, years) {predicted_scores <-0 
mod = lm(scores ~ years) 
predicted_scores<-predict(mod, data.frame(years = c(2017,2018,2019,2020))) 
return(predicted_scores) 
} 

爲了說明這一點,這就是我想在最後 - 黃行是預言:

enter image description here

回答

2

你可以嘗試dplyr用在這個非常有幫助answer描述掃帚

library(dplyr) 
library(broom) 
pred_per_group = d %>% group_by(subranking, institution) %>% 
    do(predicted_scores=predict(lm(score ~ year, data=.), data.frame(year = c(2017,2018,2019, 2020)))) 
pred_df = tidy(pred_per_group, predicted_scores) 

然後,用rbind添加帶有預測結果的數據幀給你。

pred_df <- data.frame(score=pred_df$x, year=rep(c(2017,2018,2019,2020), 5), institution=pred_df$institution, subranking=pred_df$subranking) 
result <- rbind(d, pred_df) 

在8月3日編輯:正如你想得出的編碼我會去了解它,如下所示的自己的追求:

p<- unique(d$institution) 
r <- unique(d$subranking) 
for (i in (1:length(p))){ 
    for(j in seq_along(r)){ 
    score<- d$score[d$institution==p[i] & d$subranking==r[j]] 
    year<- d$year[d$institution==p[i] & d$subranking==r[j]] 
    if(length(score)== 0){ 
    print(sprintf("No level for the following combination: Institution: %s and Subrank: %s", p[i], r[j])) 
    } else{ 
    model<- lm(score~year) 
    result<-predict(model, data.frame(year = c(2017,2018,2019,2020))) 
    z<- cbind(result,data.frame(year = c(2017,2018,2019,2020))) 
    print(sprintf("For Institution: %s and Subrank: %s the Score is:",p[i], r[j])) 
    print(z) 
    } 
    } 
} 

[1] "For Institution: 1422 and Subrank: CMP the Score is:" 
    result year 
1 51.80 2017 
2 48.75 2018 
3 45.70 2019 
4 42.65 2020 
[1] "For Institution: 1422 and Subrank: SSC the Score is:" 
    result year 
1 58.1 2017 
2 55.2 2018 
3 52.3 2019 
4 49.4 2020 
[1] "For Institution: 1422 and Subrank: ETC the Score is:" 
    result year 
1 77.00 2017 
2 76.25 2018 
3 75.50 2019 
4 74.75 2020 
[1] "No level for the following combination: Institution: 1384 and Subrank: CMP" 
[1] "For Institution: 1384 and Subrank: SSC the Score is:" 
    result year 
1 44.13333 2017 
2 43.33333 2018 
3 42.53333 2019 
4 41.73333 2020 
[1] "For Institution: 1384 and Subrank: ETC the Score is:" 
    result year 
1 80.66000 2017 
2 80.26286 2018 
3 79.86571 2019 
4 79.46857 2020 
+0

@Ago https://開頭stackoverflow.com/help/someone-answers:接受的答案是重要的,因爲它既獎勵海報解決您的問題,並通知其他人,你的問題解決了 –

+0

不知你能不能幫我,我怎麼能得到同樣的通過使用我在原始腳本中開始的兩種解決方案的結果? – Ago

+0

你說'我想根據前幾年的分數來預測各種子級別的機構分數'然而,在這兩種解決方案(1和2)中,您都放出了子級變量。雙迴路或調整後的函數可以完成dplyr和掃帚在我的答案中所做的事情,但這是你真正想要的嗎?我的意思是方法上不是以編程方式。您是否想將回歸方程中的機構/子級變量作爲因子?如果您說「是」或我不確定,我建議您訪問並詢問CrossValidated,因爲方法問題超出了StakOverflow的範圍。 –

相關問題