2016-04-27 25 views
1

我有團隊成員如何從多個團隊互相評分的數據。每個人都有自己的身份證號碼,但團隊內的評級人數也是這樣的:從同一張表查找dplyr

StudyID TeamID CATMERater Rated Rating 
    (int) (int)  (int) (dbl) (dbl) 
1 2930 551   1  1 5.000000 #How rater 1 rated 1 (themselves) 
2 2938 551   2  1 3.800000 #How rater 2 rated 1 
3 2939 551   3  1 5.000000 #How rater 3 rated 1 
4 2930 551   1  2 3.666667 #How rater 1 rated 2 
5 2938 551   2  2 4.000000 #... 
6 2939 551   3  2 3.866667 
... 

等等。我使用tidyr得到了這種格式,我試圖獲得TeamID和被評估人員相同的StudyID的新列。這是我嘗試過,但沒有工作,因爲我不知道如何引用同一個表:

edges %>% mutate(RatedStudyID = filter(edges, TeamID == TeamID & Rated == CATMERater)) 

希望這是有道理的,但我會很感激的建議在得到領導正確的方向。如果是left_join的東西我怎麼說TeamID == TeamID

這是我想什麼到底要看到(主要是最後一列雖然):每@akron

StudyID TeamID CATMERater Rated Rating RatedStudyID 
    (int) (int)  (int) (dbl) (dbl) 
1 2930 551   1  1 5.000000 2930 
2 2938 551   2  1 3.800000 2930 
3 2939 551   3  1 5.000000 2930 
4 2930 551   1  2 3.666667 2938 
5 2938 551   2  2 4.000000 2938 
6 2939 551   3  2 3.866667 2938 
... 

dput結果給出了一個錯誤:

structure(list(StudyID = c(2930L, 2938L, 2939L, 2930L, 2938L, 
2939L, 2930L, 2938L, 2939L, 2930L, 2938L, 2939L, 2930L, 2938L, 
2939L, 2930L, 2938L, 2939L, 2920L, 2941L, 2989L, 2920L, 2941L, 
2989L, 2920L, 2941L, 2989L, 2920L, 2941L, 2989L, 2920L, 2941L, 
2989L, 2920L, 2941L, 2989L, 2922L, 2924L, 2943L, 2922L, 2924L, 
2943L, 2922L, 2924L, 2943L, 2922L, 2924L, 2943L, 2922L, 2924L 
), TeamID = c(551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 
551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 551L, 552L, 
552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 552L, 
552L, 552L, 552L, 552L, 552L, 552L, 553L, 553L, 553L, 553L, 553L, 
553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L, 553L), CATMERater = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L, 
2L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L), Rated = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 
6, 6, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 1, 
1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5), Rating = c(5, 3.8, 5, 
3.66666666666667, 4, 3.86666666666667, 4.53333333333333, 4, 4.8, 
NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA, 3.93333333333333, 5, 
5, 5, 5, 5, 5, 5, 5, NaN, NaN, NaN, NaN, NaN, NaN, NA, NA, NA, 
4, 4, 4, 4, 4, 4, 4, 3.86666666666667, 4, NaN, NaN, NaN, NaN, 
NaN)), .Names = c("StudyID", "TeamID", "CATMERater", "Rated", 
"Rating"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-50L)) 
+2

查看[如何創建可重現的示例](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)以獲取更好的方式來共享示例數據讓它更容易幫助你。 – MrFlick

+0

你可以輸入數據幀嗎? – user2600629

+1

'edges%>%group_by(Rated,TeamID)%>%mutate(new = StudyID [CATMERater == Rated])'? – jeremycg

回答

1

從評論:

library(dplyr) 
x %>% 
    group_by(Rated, TeamID) %>% #group by each team/rated individual 
    filter(any(CATMERater == Rated)) %>% #filter out any groups with unrated individuals 
    mutate(new = StudyID[CATMERater == Rated]) #make the new column 

新列由子集劃分每個組創建 - 它是一樣的x$StudyID[x$CATMERater == x$Rated]將在整個數據幀上。只要我們有一個地方這是真實的(即自我評估)的價值是爲該組的每個成員設置的。

0

隨着data.table

library(data.table) 
setDT(edges)[ , RatedStudyID := StudyID[CATMERater == Rated] , .(Rated, TeamID)] 
edges 
# StudyID TeamID CATMERater Rated Rating RatedStudyID 
#1: 2930 551   1  1 5.000000   2930 
#2: 2938 551   2  1 3.800000   2930 
#3: 2939 551   3  1 5.000000   2930 
#4: 2930 551   1  2 3.666667   2938 
#5: 2938 551   2  2 4.000000   2938 
#6: 2939 551   3  2 3.866667   2938 

在新數據集中,有些組沒有任何類似的CATMERater值並在同一行中評分。所以,我們可以使用異常來爲那些返回NA。

setDT(df1)[, RatedStudyID :=if(!any(CATMERater==Rated)) NA_integer_ 
      else StudyID[CATMERater ==Rated], .(Rated, TeamID)] 
0

我認爲你可以解決這個問題有一個連接

edges %>% 
    select(TeamID, Rated = CATMERater, RaterStudyID = StudyID) %>% 
    inner_join(edges, by = c("TeamID", "Rated"))