2017-09-05 153 views
4

讓我們假設我有一個誰看了電影的人一個數據表,像複雜data.table操作

library(data.table) 
DT = fread(" 
User,  Movie 
Alice ,  Fight Club 
Alice,  The Godfather 
Bob,   Titanic 
Charlotte, The Godfather") 

我要計算,每對電影的,誰看了都的人數和誰看了至少一個人的數量,即

Movie1  Movie2   WatchedOne WatchedBoth 
Fight Club The Godfather 2   1 
The Godfather Titanic   3   0 
Fight Club Titanic   2   0 

我有幾百萬行,我需要一個極快的data.table功能:-)

感謝您的幫助!

+4

嘗試製作一個容易重現的例子(例如,可以複製粘貼)。請參閱https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank

+0

好的我做到了,謝謝 – mac

+0

您的數據集包含多少部不同的電影? – Uwe

回答

1

這實現了你是什麼之後

library(data.table) 

mydt <- data.table(User = c("Alice", "Alice", "Bob", "Charlotte"), 
       Movie = c("Fight Club", "The Godfather", "Titanic", "The Godfather")) 
## 
mydt2 <- data.table(t(mydt[,combn(unique(Movie), 2, simplify = FALSE)])) 
names(mydt2) <- c("Movie1", "Movie2") 
## 
temp <- apply(mydt2, 1, function(x) mydt[Movie %in% x, .N, by = User]) 
mydt2[, WatchedOne := lapply(temp, function(x) x[, length(N)])] 
mydt2[, WatchedBoth := lapply(temp, function(x) x[, sum(N==2)])] 

# Movie1  Movie2 WatchedOne WatchedBoth 
# 1: Fight Club The Godfather   2   1 
# 2: Fight Club  Titanic   2   0 
# 3: The Godfather  Titanic   3   0 
+0

是否可以並行化代碼以利用多個處理器?謝謝 – mac

2

另一種方式:

DT = DT[, .(Users = list(User)), keyby='Movie'] 

Y = data.table(t(combn(DT$Movie, 2))) 
setnames(Y, c('Movie1','Movie2')) 

Y[DT, on=.(Movie1==Movie), Movie1.Users:= Users] 
Y[DT, on=.(Movie2==Movie), Movie2.Users:= Users] 

#Y[, WatchedOne:= lengths(Map(union, Movie1.Users, Movie2.Users))] 
Y[, WatchedBoth:= lengths(Map(intersect, Movie1.Users, Movie2.Users))] 
# better: 
Y[, WatchedOne:= lengths(Movie1.Users) + lengths(Movie2.Users) - WatchedBoth] 

> Y[, -(3:4)] 
#   Movie1  Movie2 WatchedBoth WatchedOne 
# 1: Fight Club The Godfather   1   2 
# 2: Fight Club  Titanic   0   2 
# 3: The Godfather  Titanic   0   3 
+0

是否可以並行化代碼以利用多個處理器?謝謝 – mac

+0

是的,這是可能的 - 我認爲瓶頸將是計算'union'和'intersect'。可能首先嚐試將聯合計算爲「長度(Movie1.Users)+長度(Movie2.Users) - WatchedBoth」,然後查看得到您的位置。比我寫的更有效率 – sirallen

0

@sirallen @simone 謝謝您的回答,我想兩者兼得。 然而,我發現最快的方式爲以下:

DT_comb <- as.data.table(t(combn(movie, 2))) 

colnames(DT_comb) <- c("movie1", "movie2") 

function_1 <- function(movie_i, movie_j){ 
    ur_i = DT[movie == movie_i, user_ID] 
    ur_j = DT[movie == movie_j, user_ID] 
    x = length(intersect(ur_i, ur_j)) 
    return(x) 
} 

function_2 <- function(movie_i, movie_j){ 
    ur_i = DT[movie == movie_i, user_ID] 
    ur_j = DT[movie == movie_j, user_ID] 
    x = length(union(ur_i, ur_j)) 
    return(x) 
} 

cl <- makeCluster(detectCores() - 1) 

clusterExport(cl=cl, varlist=c("DT", "function_1", "function_2")) 

clusterCall(cl, function() library(data.table)) 

DT_comb$Watched_One <- clusterMap(cl, 
            function_1, 
            DT_corr$movie1, 
            DT_corr$movie2) 

DT_comb$Watched_Both <- clusterMap(cl, 
            function_2, 
            DT_corr$movie1, 
            DT_corr$movie2) 

stopCluster(cl) 

也許並行當你的解決方案速度甚至超過我的嗎? :-)