如何用R中的數據框創建不同格式的矩陣？

我data.frame低於：如何用R中的數據框創建不同格式的矩陣？

group_id user_id 
1000  26 
1236  29 
1236  46 
3767  26 
3767  46 
5614  29 
5614  45 
5614  46

我需要輸出如下：

User-1 User-2 #of-common-groups 
26  26  2 
26  46  1 
29  29  2 
29  45  1 
29  46  2 
45  29  1 
45  45  1 
45  46  1 
46  26  1 
46  29  2  
46  45  1 
46  46  3

是否有一個快速的方法來實現這一目標？我實際上有137個不同的組和大約81000個用戶。

用戶26屬於2組，他也分享同組3767與用戶46.因此

26 26 2 
26 46 1 
46 26 1 
46 46 3 (user 46 belongs to 3 groups) etc

來源

2015-01-03 Viswa

多少行沒有自己的真實數據有 - 我懷疑我的回答不會是非常有效的，如果是大 - 或crossprod不會工作 – user20650

我實際上有127,838行數據。我到現在還沒有使用過「圖表」 - 現在是時候學習了。 – Viswa

下面是一個使用Matrix包的嘗試 - 從here只是複製@nograpes'回答：

require(Matrix) 
sm = sparseMatrix(dat$group_id, dat$user_id, x = TRUE) 
cp = t(sm) %*% sm 
as.data.frame(summary(cp)) 
#  i j x 
# 1 26 26 2 
# 2 46 26 1 
# 3 29 29 2 
# 4 45 29 1 
# 5 46 29 2 
# 6 29 45 1 
# 7 45 45 1 
# 8 46 45 1 
# 9 26 46 1 
# 10 29 46 2 
# 11 45 46 1 
# 12 46 46 3

來源

2015-01-04 22:25:22 Arun

這很快，給了我想要的東西。謝謝Arun。 – Viswa

# your data 
dat <- read.table(text="group_id user_id 
1000  26 
1236  29 
1236  46 
3767  26 
3767  46 
5614  29 
5614  45 
5614  46", header=T) 

# convert to matrix 
m <- as.matrix(table(dat)) 

#calculate and rehape 
mm <- crossprod(m,m) 
r <- reshape2::melt(mm) 

# remove where zero counts 
r[r$value !=0 ,] 
# user_id user_id value 
# 1  26  26  2 
# 4  46  26  1 
# 6  29  29  2 
# 7  45  29  1 
# 8  46  29  2 
# 10  29  45  1 
# 11  45  45  1 
# 12  46  45  1 
# 13  26  46  1 
# 14  29  46  2 
# 15  45  46  1 
# 16  46  46  3

編輯：想法來自Network: Making Graph Object from Event-Node Data Using igraph

g <- graph.data.frame(dat, directed = FALSE) 

V(g)$type <- V(g)$name %in% unique(as.character(dat$group_id)) 

b <- bipartite.projection(g)$proj1 

ad <- get.adjacency(b, sparse=F, attr="weight") 
ad <- ad[sort(colnames(ad)), sort(colnames(ad))] 

diag(ad) <- colSums(table(dat)) 

then continue as before

來源

2015-01-03 21:57:44 user20650

然而，這是一個很好的解決方案，有127000行，crossprod運行很長時間，我的RStudio會話在16GB RAM Mac上凍結。有沒有更有效的方法來做crossprod？ – Viswa

哈，認爲這可能是一個問題 - 請參閱上面的評論。這可能是一種「圖形」方法。 – user20650

怎麼樣：

df <- read.table(text="group_id user_id 
1000  26 
1236  29 
1236  46 
3767  26 
3767  46 
5614  29 
5614  45 
5614  46", header=T) 

df <- merge(df, df, by = "group_id")[,-1] 
library(plyr) 
ddply(df,.(user_id.x, user_id.y),nrow) 

    user_id.x user_id.y V1 
1   26  26 2 
2   26  46 1 
3   29  29 2 
4   29  45 1 
5   29  46 2 
6   45  29 1 
7   45  45 1 
8   45  46 1 
9   46  26 1 
10  46  29 2 
11  46  45 1 
12  46  46 3

編輯： 我擔心這是簡單的，因爲merge具有「大量」的用戶和小組數量。根據最終用途的不同，我會考慮一個圖形結構，正如user20650已經建議的那樣，並且可能保持原樣。在很多情況下，快速查找頂點（user.id）的無向加權圖似乎是一個很好的解決方案。

我會離開這個簡單的方法來處理較小的數據集（或只是較少的重疊）。

來源

2015-01-03 23:33:21

這是一個很好的解決方案！（+1）我會考慮使用dplyr或data.table添加相同的方法以獲得更好的性能。例如在dplyr：'df％>％left_join（。，。，by =「group_id」）％>％select（-group_id）％>％count（user_id.x，user_id.y）' –

JR - 工作也是如此。我得到錯誤「vecseq中的錯誤（f__，len__，if（allow.cartesian || notjoin）NULL else as.integer（max（nrow（x），：加入結果560022260行;超過127828 = max（nrow （x），nrow（i））檢查i中是否有重複的鍵值，每個鍵值一次又一次地連接到x中的同一組。如果沒有問題，可以嘗試使用'j'和''by'（by-without-by），這樣j爲每個組運行，以避免大的分配...;然後我添加了allow.cartesian = TRUE，但這不會返回任何內容。 – Viswa

Docendo - ％>％做什麼？ – Viswa

因此，這裏有兩種方法，一是使用data.table ...

library(data.table) 
setkey(setDT(df),group_id) 
df[df,allow.cartesian=TRUE][,.N,by=list(user_id,i.user_id)][order(user_id,i.user_id)] 
#  user_id i.user_id N 
# 1:  26  26 2 
# 2:  26  46 1 
# 3:  29  29 2 
# 4:  29  45 1 
# 5:  29  46 2 
# 6:  45  29 1 
# 7:  45  45 1 
# 8:  45  46 1 
# 9:  46  26 1 
# 10:  46  29 2 
# 11:  46  45 1 
# 12:  46  46 3

和一個使用sqldf ...

library(sqldf) 
sqldf("select a.user_id as user1, b.user_id as user2, count(*) as groups 
     from df a inner join df b on a.group_id=b.group_id 
     group by 1,2 order by 1,2") 
# user1 user2 groups 
# 1  26 26  2 
# 2  26 46  1 
# 3  29 29  2 
# 4  29 45  1 
# 5  29 46  2 
# 6  45 29  1 
# 7  45 45  1 
# 8  45 46  1 
# 9  46 26  1 
# 10 46 29  2 
# 11 46 45  1 
# 12 46 46  3

的data.table方法可能會更快，但你的數據集不是很大，所以它可能沒有太大區別。

來源

2015-01-04 00:12:09 jlhoward

你的data.table解決方案的工作原理，但它在我的127828行數據集上花費了很長時間。 – Viswa

我認爲[這個答案]（http://stackoverflow.com/a/26246588/559784）可能會表現更好（對於data.table），但Matrix包似乎最適合這份工作。 – Arun

如何用R中的數據框創建不同格式的矩陣？

回答

相關問題