2016-07-07 20 views
4

鑑於矩陣m如下(1-5行方向排列)計數在矩陣有序對的數目:中的R

# [,1] [,2] [,3] [,4] [,5] 
# [1,] 1 5 2 4 3 
# [2,] 2 1 4 3 5 
# [3,] 3 4 1 2 5 
# [4,] 4 1 3 2 5 
# [5,] 4 3 1 2 5 
# [6,] 1 4 2 3 5 
# [7,] 4 3 2 5 1 
# [8,] 4 1 3 5 2 
# [9,] 1 2 3 4 5 
# [10,] 4 3 2 1 5 

我想知道的次數每個元素1- 5在每行另一個元素之前(即考慮所有可能的配對)

例如,對於(1,5),15之前,所有行中的9次。另一個例子,對於(3,1),3優先於1,在所有行中爲4次。我希望所有行中的所有可能配對具有相同的結果。也就是,

# (1, 2), (1, 3), (1, 4), (1, 5) 
# (2, 1), (2, 3), (2, 4), (2, 5) 
# (3, 1), (3, 2), (3, 4), (3, 5) 
# (4, 1), (4, 2), (4, 3), (4, 5) 
# (5, 1), (5, 2), (5, 3), (5, 4) 

m <- structure(c(1L, 2L, 3L, 4L, 4L, 1L, 4L, 4L, 1L, 4L, 5L, 1L, 4L, 
1L, 3L, 4L, 3L, 1L, 2L, 3L, 2L, 4L, 1L, 3L, 1L, 2L, 2L, 3L, 3L, 
2L, 4L, 3L, 2L, 2L, 2L, 3L, 5L, 5L, 4L, 1L, 3L, 5L, 5L, 5L, 5L, 
5L, 1L, 2L, 5L, 5L), .Dim = c(10L, 5L)) 

如何在R中有效地做到這一點?

編輯

你將如何爲這個矩陣做?

 # [,1] [,2] [,3] [,4] [,5] 
# [1,] 3 4 1 5 0 
# [2,] 1 2 5 3 0 
# [3,] 3 5 0 0 0 
# [4,] 4 5 0 0 0 
# [5,] 3 4 1 5 2 
# [6,] 3 1 2 0 0 
# [7,] 4 1 5 2 0 
# [8,] 4 3 5 2 0 
# [9,] 5 2 0 0 0 
# [10,] 5 4 2 0 0 

m <- structure(c(3, 1, 3, 4, 3, 3, 4, 4, 5, 5, 4, 2, 5, 5, 4, 1, 1, 
3, 2, 4, 1, 5, 0, 0, 1, 2, 5, 5, 0, 2, 5, 3, 0, 0, 5, 0, 2, 2, 
0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0), .Dim = c(10L, 5L)) 
+0

'm'確實包含排列嗎?即,在每一行中,每個數字只出現一次(除了0)?你的實際'm'有多大? –

+0

@alexis_laz是的,每個數字每行只出現一次。 'm'的大小可以是1000-10000左右,列數可以是2-10左右。 – 989

+0

另外,0是否總是聚集在每行的末尾? –

回答

1

知道:(1)各行沒有重複,(2)每行的0的在端簇和,(3)nrow(m)是幅值較大的2-3個數量級比ncol(m),我們可以遍歷列搜索特定數量的減少不必要的計算的出現時的0都達到了:

ff = function(x, a, b) 
{ 
    ia = rep_len(NA_integer_, nrow(x)) # positions of 'a' in each row 
    ib = rep_len(NA_integer_, nrow(x)) # -//- of 'b' 
    notfound0 = seq_len(nrow(x)) # rows that have not, yet, a 0 
    for(j in seq_len(ncol(x))) { 
     xj = x[notfound0, j] 
     if(!length(xj)) break 

     ia[notfound0[xj == a]] = j 
     ib[notfound0[xj == b]] = j 

     notfound0 = notfound0[xj != 0L] # check if any more rows have 0 now on 
    } 

    i = ia < ib ## is 'a' before 'b'? 

    ## return both a - b and b - a; no need to repeat computations 
    data.frame(a = c(a, b), 
       b = c(b, a), 
       n = c(sum(i, na.rm = TRUE), sum(!i, na.rm = TRUE))) 
} 

而且在editted m

ff(m, 3, 2) 
# a b n 
#1 3 2 3 
#2 2 3 1 
ff(m, 5, 1) 
# a b n 
#1 5 1 0 
#2 1 5 4 

而對於所有對:

xtabs(n ~ a + b, 
     do.call(rbind, 
       combn(5, 2, function(x) ff(m, x[1], x[2]), 
        simplify = FALSE))) 
# b 
#a 1 2 3 4 5 
# 1 0 4 1 0 4 
# 2 0 0 1 0 1 
# 3 3 3 0 2 4 
# 4 3 4 1 0 5 
# 5 0 5 1 1 0 

而且,也似乎在更大的規模容忍:

set.seed(007) 
MAT = do.call(rbind, combinat::permn(8))[sample(1e4), ] 
MAT[sample(length(MAT), length(MAT)*0.4)] = 0L #40% 0s 
MAT = t(apply(MAT, 1, function(x) c(x[x != 0L], rep_len(0L, sum(x == 0L))))) 
dim(MAT) 
#[1] 10000  8 

## including colonel's answer for a quick comparison 
colonel = function(x, a, b) 
{ 
    i = (which(!t(x - b)) - which(!t(x - a))) > 0L 
    data.frame(a = c(a, b), b = c(b, a), n = c(sum(i), sum(!i))) 
} 

microbenchmark::microbenchmark(ff(MAT, 7, 2), colonel(MAT, 7, 2)) 
#Unit: milliseconds 
#    expr  min  lq  mean median  uq  max neval cld 
#  ff(MAT, 7, 2) 3.795003 3.908802 4.500453 3.972138 4.096377 45.926679 100 b 
# colonel(MAT, 7, 2) 2.156941 2.231587 2.423053 2.295794 2.404894 3.775516 100 a 
#There were 50 or more warnings (use warnings() to see the first 50) 

所以,只要該辦法的一個簡單的翻譯成一個圈證明足夠有效。更多的0也應該進一步減少計算時間。

3

首先,我們如何能夠通過硬編碼的數字做:

apply(m, 1, function(r) { which(r == 1) < which(r == 5) }) 
# [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE 
sum(apply(m, 1, function(r) { which(r == 1) < which(r == 5) })) 
# [1] 9 

要爲(不一樣),這裏的一切對一個data.frame的1:5所有組合自動完成:

df <- expand.grid(a = 1:5, b = 1:5) 
df <- df[ df$a != df$b, ] 
head(df) 
# a b 
# 2 2 1 
# 3 3 1 
# 4 4 1 
# 5 5 1 
# 6 1 2 
# 8 3 2 

現在我們只需要遍歷在這每一行(我想我們可以使用一個矩陣與另一個apply):

df$seqs <- sapply(seq_len(nrow(df)), function(i) { 
    sum(apply(m, 1, function(r) which(r == df$a[i]) < which(r == df$b[i]))) 
}) 
head(df) 
# a b seqs 
# 2 2 1 3 
# 3 3 1 4 
# 4 4 1 6 
# 5 5 1 1 
# 6 1 2 7 
# 8 3 2 6 

另外,我覺得這是一個偉大的時刻使用mapply

myfunc <- function(a, b, m) sum(apply(m, 1, function(r) which(r == a) < which(r == b))) 
df$seqs <- mapply(myfunc, df$a, df$b, list(m)) 
head(df) 
# a b seqs 
# 2 2 1 3 
# 3 3 1 4 
# 4 4 1 6 
# 5 5 1 1 
# 6 1 2 7 
# 8 3 2 6 

當然,我用了一個正式的功能副匿名一個(可能是上面做了,太),但這個節目關掉這種方法的一些優雅。

編輯:新的約束條件,現在可能在m中沒有匹配。由於which在沒有匹配時返回logical(0),所以以上失敗,導致sapply返回異構列表。要解決這個問題的方法之一是用一個快速的輔助功能:

apply(m, 1, function(r) which(r == a) < which(r == b)) 
# [[1]] 
# logical(0) 
# [[2]] 
# [1] FALSE 
# ... 

emptyF <- function(x) sapply(x, function(y) if (! length(y)) FALSE else y) 
emptyF(apply(m, 1, function(r) which(r == a) < which(r == b))) 
# [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

現在myfunc變爲:

myfunc <- function(a, b, m) sum(emptyF(apply(m, 1, function(r) which(r == a) < which(r == b)))) 

(注:我喜歡Colonel Beauvel's answer中,它的矢量化,因此有可能更快也將受益從這個和類似的補救措施到不匹配。))

6

這裏是一個矢量化的解決方案,而無需apply

func <- function(a,b) sum((which(!t(m-b)) - which(!t(m-a)))>0) 

#> func(1,5) 
#[1] 9 
#> func(5,1) 
#[1] 1 

,併產生所有你可以簡單地做想要的組合:

N = combn(1:5, 2) 
cbind(N, N[nrow(N):1,]) 

然後你只需要一個循環遍歷列並應用該功能。

1

嘗試此

library(plyr) 
combns <- expand.grid(unique(as.vector(m)),unique(as.vector(m))) 
combns <- combns[combns$Var1!=combns$Var2,] 
combns <- combns[with(combns,order(Var1)),] 
combns$count <- sapply(1:nrow(combns),function(u) sum(unlist(apply(apply(m,1,function(t) match(t,combns[u,])),2,function(s) na.exclude(count(unlist(sapply(seq(length(s)),function(t) diff(s,lag=t))))$freq[count(unlist(sapply(seq(length(s)),function(t) diff(s,lag=t))))$x==1]))),na.rm = T))