2014-07-03 42 views
1

考慮兩個因素(各一組相同的水平),說的R因素外產品 - 自定義映射 - 太慢

lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot") 
A <- factor(sample(lev, 6000, TRUE)) 
B <- factor(sample(lev, 6000, TRUE)) 

我想利用自己的產品外,有一個自定義的產品功能,定義像這樣:

mapping <- matrix(c(
    "green", "blue", "blue", "red", "red", "red", 
    "blue", "green", "blue", "red", "red", "red", 
    "blue", "blue", "green", "red", "red", "red", 
    "red", "red", "red", "green", "yellow", "red", 
    "red", "red", "red", "yellow", "green", "red", 
    "red", "red", "red", "red", "red", "green"), 
    nrow=6, ncol=6, 
    dimnames=list(lev, lev)) 
mapper <- function (X, Y) mapping[matrix(c(levels(X)[X], levels(Y)[Y]), 
             ncol=2, byrow=TRUE)] 
A.B <- outer(A, B, FUN=mapper) 

預期輸出(爲一個顯着減小的測試用例)應該是這樣的

> A 
[1] alpha foxtrot echo charlie echo foxtrot bravo delta charlie 
Levels: alpha bravo charlie delta echo foxtrot 
> B 
[1] alpha foxtrot delta bravo bravo alpha alpha bravo alpha 
Levels: alpha bravo delta foxtrot 
> outer(A, B, mapper) 
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 
[1,] "red" "red" "red" "red" "red" "green" "green" "green" "green" 
[2,] "red" "red" "red" "red" "red" "green" "green" "green" "green" 
[3,] "red" "red" "red" "red" "red" "green" "green" "green" "green" 
[4,] "red" "red" "red" "red" "red" "green" "green" "green" "green" 
[5,] "blue" "blue" "blue" "blue" "blue" "red" "green" "green" "blue" 
[6,] "red" "red" "red" "red" "green" "green" "green" "green" "green" 
[7,] "red" "red" "red" "red" "green" "green" "green" "green" "green" 
[8,] "red" "red" "red" "red" "green" "green" "green" "green" "green" 
[9,] "red" "red" "red" "red" "green" "green" "green" "green" "green" 

這工作,但在滿刻度是令人不快的慢:

> system.time(outer(A, B, mapper)) 
    user system elapsed 
11.381 5.015 17.653 

誰能推薦一個更快的方法?如果它有幫助,映射矩陣保證爲三角形(即mapping[a,b] == mapping[b,a]∀a,b)。

+1

小問題:你的意思是'sample(lev,6000,TRUE)'?否則,你只是得到'NA',我想...... – joran

+1

也可以給你一些期望的輸出,以確保我們得到正確的答案? – MrFlick

+0

考慮到這一點,我的第一反應是想知道你是否必須有(三角形)矩陣輸出。這感覺更像是一個加入(或合併)給我。 – joran

回答

4

編輯:看來這個問題發生了巨大變化,而我回答,但無論如何,這裏留下這個。


我假設@ joran的評論是正確的,你的意思是(和固定在lev順序)

lev <- c("alpha", "bravo", "charlie", "delta", "echo", "foxtrot") 
A <- factor(sample(lev, 6000, TRUE), levels=lev) 
B <- factor(sample(lev, 6000, TRUE), levels=lev) 

而且,mapping不是一個二維數組(矩陣),也不是一個嵌套的數據結構(名單列表)就像你似乎認爲

> mapping 
    alpha.alpha  alpha.bravo alpha.charlie  alpha.delta  alpha.echo 
     "green"   "blue"   "blue"   "red"   "red" 
    alpha.foxtrot  bravo.alpha  bravo.bravo bravo.charlie  bravo.delta 
      "red"   "blue"   "green"   "blue"   "red" 
    bravo.echo bravo.foxtrot charlie.alpha charlie.bravo charlie.charlie 
      "red"   "red"   "blue"   "blue"   "green" 
    charlie.delta charlie.echo charlie.foxtrot  delta.alpha  delta.bravo 
      "red"   "red"   "red"   "red"   "red" 
    delta.charlie  delta.delta  delta.echo delta.foxtrot  echo.alpha 
      "red"   "green"  "yellow"   "red"   "red" 
    echo.bravo echo.charlie  echo.delta  echo.echo echo.foxtrot 
      "red"   "red"  "yellow"   "red"   "red" 
    foxtrot.alpha foxtrot.bravo foxtrot.charlie foxtrot.delta foxtrot.echo 
      "red"   "red"   "red"   "red"   "red" 
foxtrot.foxtrot 
     "green" 

現在,如果你想保存這是一個列表的列表:

mapping <- list(
    "alpha" = list("alpha"="green", "bravo"="blue", "charlie"="blue", 
        "delta"="red", "echo"="red", "foxtrot"="red"), 
    "bravo" = list("alpha"="blue", "bravo"="green", "charlie"="blue", 
        "delta"="red", "echo"="red", "foxtrot"="red"), 
    "charlie" = list("alpha"="blue", "bravo"="blue", "charlie"="green", 
        "delta"="red", "echo"="red", "foxtrot"="red"), 
    "delta" = list("alpha"="red", "bravo"="red", "charlie"="red", 
        "delta"="green", "echo"="yellow", "foxtrot"="red"), 
    "echo" = list("alpha"="red", "bravo"="red", "charlie"="red", 
        "delta"="yellow", "echo"="red", "foxtrot"="red"), 
    "foxtrot" = list("alpha"="red", "bravo"="red", "charlie"="red", 
        "delta"="red", "echo"="red", "foxtrot"="green") 
) 
mapper = function(X, Y) mapping[[levels(X)[X]]][[levels(Y)[Y]]] 

請注意,我在創造mappingmapper使用list代替c使用提取([[)未子集化([)符號。

檢查這個工程的一個值:

> mapper(A[1], B[1]) 
[1] "red" 

而對於只有幾個值:

> mapper(A[1:2], B[1:2]) 
Error in mapping[[levels(X)[X]]][[levels(Y)[Y]]] : 
    attempt to select more than one element 

所以我們看到mapper沒有矢量(因爲它必須是)。從outer的幫助頁面:

FUN以這兩個擴展向量爲參數被調用。因此,它必須是一個矢量化函數(或一個名稱),期望至少有兩個參數。

最簡單的,但不一定是有效的,方式向量化它:

> Vectorize(mapper)(A[1:2], B[1:2]) 
[1] "red" "green" 

這現在工作在一個子集:

> outer(A[1:6], B[1:6], FUN=Vectorize(mapper)) 
    [,1] [,2]  [,3] [,4] [,5] [,6]  
[1,] "red" "yellow" "red" "red" "red" "red" 
[2,] "red" "green" "red" "red" "red" "yellow" 
[3,] "red" "green" "red" "red" "red" "yellow" 
[4,] "blue" "red" "blue" "red" "blue" "red" 
[5,] "green" "red" "green" "red" "green" "red" 
[6,] "red" "red" "red" "green" "red" "red" 

讓我們來看看時序:

> system.time(outer(A[1:6], B[1:6], FUN=Vectorize(mapper))) 
    user system elapsed 
     0  0  0 
> system.time(outer(A[1:60], B[1:60], FUN=Vectorize(mapper))) 
    user system elapsed 
    0.22 0.00 0.22 
> system.time(outer(A[1:600], B[1:600], FUN=Vectorize(mapper))) 
    user system elapsed 
    23.97 0.01 24.01 

查看外部產品長度的線性,或q在A或B的長度上是不成熟的。我沒有等待40分鐘,看看6000x6000是否可以工作。

我們可以讓這個效率更高嗎?雙遞歸到遞歸結構(然後在頂部使用Vectorize)並不是那麼有效。讓我們使用不同的數據結構:一個二維數組(矩陣)並使用基於矩陣的索引。

mapping <- matrix(c("green", "blue", "blue", "red", "red", "red", 
        "blue", "green", "blue", "red", "red", "red", 
        "blue", "blue", "green", "red", "red", "red", 
        "red", "red", "red", "green", "yellow", "red", 
        "red", "red", "red", "yellow", "red", "red", 
        "red", "red", "red", "red", "red", "green"), 
        nrow = 6, ncol = 6, 
        dimnames = list(lev, lev)) 
mapper <- function(X, Y) mapping[cbind(as.character(X), as.character(Y))] 

和測試這個

> A[1:6] 
[1] echo delta delta charlie alpha foxtrot 
Levels: alpha bravo charlie echo delta foxtrot 
> B[1:6] 
[1] alpha delta alpha foxtrot alpha echo 
Levels: alpha bravo charlie echo delta foxtrot 
> mapper(A[1], B[1]) 
[1] "red" 
> mapper(A[1:2], B[1:2]) 
[1] "red" "green" 
> outer(A[1:6], B[1:6], FUN=mapper) 
    [,1] [,2]  [,3] [,4] [,5] [,6]  
[1,] "red" "yellow" "red" "red" "red" "red" 
[2,] "red" "green" "red" "red" "red" "yellow" 
[3,] "red" "green" "red" "red" "red" "yellow" 
[4,] "blue" "red" "blue" "red" "blue" "red" 
[5,] "green" "red" "green" "red" "green" "red" 
[6,] "red" "red" "red" "green" "red" "red" 

看起來不錯。檢查時間:

> system.time(outer(A[1:6], B[1:6], FUN=mapper)) 
    user system elapsed 
     0  0  0 
> system.time(outer(A[1:60], B[1:60], FUN=mapper)) 
    user system elapsed 
     0  0  0 
> system.time(outer(A[1:600], B[1:600], FUN=mapper)) 
    user system elapsed 
    0.22 0.00 0.22 
> system.time(outer(A, B, FUN=mapper)) 
    user system elapsed 
    7.80 1.48 9.30 

稍微超過9秒而不是約40分鐘約250倍的加速。

+0

感謝您解釋*爲什麼*問題的原始版本竟然不起作用。我不能說我真的很喜歡R的數據結構處理;通常我只是有一個大數據框,我知道如何使用'lm'和'ggplot' ;-) – zwol

+0

請注意,由於您使用了字符索引矩陣,因此比@MrFlick報告的要慢(請參閱我的答案;數字是方式更快)。 – BrodieG

+0

數字索引將比字符索引更快,但我認爲這是明確的意圖(並沒有感覺到需要擠出更高的效率)。爲了更有效的解決方案,識別關於將標籤映射到需要製作的索引的假設,使用數字版本將是首選。 –

2

因此,您的mapping變量不太正確。如果你看看

str(mapping) 
# Named chr [1:36] "green" "blue" "blue" "red" "red" "red" ... 
# - attr(*, "names")= chr [1:36] "alpha.alpha" "alpha.bravo" "alpha.charlie" ... 

你會發現它是一維的字符向量。這些元素的名稱用「。」粘貼在一起。我假設這不是你想要的。也許你已經使用list()而不是c()?但是,如果你能控制的格式,爲什麼不能用一個簡單的矩陣

mapping <- structure(c("green", "blue", "blue", "red", "red", "red", "blue", 
"green", "blue", "red", "red", "red", "blue", "blue", "green", 
"red", "red", "red", "red", "red", "red", "green", "yellow", 
"red", "red", "red", "red", "yellow", "red", "red", "red", "red", 
"red", "red", "red", "green"), .Dim = c(6L, 6L), .Dimnames = list(
    c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot" 
    ), c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot" 
    ))) 

所以這是一個行和列的lev每個值和單元格的顏色是組合的顏色。

然後,如果你做

#sample data 
lev <- c("alpha", "bravo", "charlie", "echo", "delta", "foxtrot") 
A <- factor(sample(lev, 6000, TRUE), levels=lev) 
B <- factor(sample(lev, 6000, TRUE), levels=lev) 

#run mapping 
out <- outer(A, B, FUN=function(a,b) mapping[cbind(a,b)]) 

現在出去將沿着cols和正確的顏色爲單元格值兩者之間的相互作用行和B的值A的值。這將運行相當迅速

system.time(outer(A, B, FUN=function(a,b) mapping[cbind(a,b)])) 

# user system elapsed 
# 0.90 0.25 1.15 
+0

嗯。我意識到我的代碼沒有像我以前所做的那樣做任何事情後,我想出了類似的東西(請參閱編輯的問題)。您的版本速度稍快,但在此計算機上仍需5秒(已過) - 也許您的速度更快? – zwol

+0

也許吧。無論哪種方式,它都沒有接近您首次報告的15分鐘以內。 – MrFlick

0

我相信這你想要做什麼,在約4秒(運行約4倍速度更快,如果你不加名字,但結果看起來並不好)。請注意,非常重要:這隻適用於AB的級別相同且級別與mapping.mx的名稱相同的級別。這是因爲cbind將因素強制爲數字,所以映射是位置的。如果不是這種情況,你可以強制AB這個字符,它可以工作,但速度會更慢。

names(A) <- A 
names(B) <- B 

mapping.mx <- do.call(rbind, mapping.lst) # see below for mapping.lst 
system.time(res <- outer(A, B, function(x, y) mapping.mx[cbind(x, y)])) 

# user system elapsed 
# 3.33 0.62 3.95 

str(res) 

# chr [1:6000, 1:6000] "red" "green" "green" "blue" "green" "blue" ... 
# - attr(*, "dimnames")=List of 2 
# ..$ : chr [1:6000] "delta" "alpha" "alpha" "bravo" ... 
# ..$ : chr [1:6000] "alpha" "alpha" "echo" "delta" ... 

res[1:5, 1:5] 

#  alpha alpha echo  delta charlie 
# delta "red" "red" "yellow" "red" "red" 
# alpha "green" "green" "red" "red" "blue" 
# alpha "green" "green" "red" "red" "blue" 
# bravo "blue" "blue" "red" "red" "blue" 
# alpha "green" "green" "red" "red" "blue"  

而且mapping.lst(基本上,你的一樣,但改變了第一clist):

mapping.lst <- list(
    "alpha" = c("alpha"="green", "bravo"="blue", "charlie"="blue", 
       "delta"="red", "echo"="red", "foxtrot"="red"), 
    "bravo" = c("alpha"="blue", "bravo"="green", "charlie"="blue", 
       "delta"="red", "echo"="red", "foxtrot"="red"), 
    "charlie" = c("alpha"="blue", "bravo"="blue", "charlie"="green", 
       "delta"="red", "echo"="red", "foxtrot"="red"), 
    "delta" = c("alpha"="red", "bravo"="red", "charlie"="red", 
       "delta"="green", "echo"="yellow", "foxtrot"="red"), 
    "echo" = c("alpha"="red", "bravo"="red", "charlie"="red", 
       "delta"="yellow", "echo"="red", "foxtrot"="red"), 
    "foxtrot" = c("alpha"="red", "bravo"="red", "charlie"="red", 
       "delta"="red", "echo"="red", "foxtrot"="green") 
)