2012-07-24 109 views
2

平行R相當新穎快速提問。我有一個計算密集的算法。幸運的是,它可以很容易地分解成片段,以利用multicoresnow。我想知道的是,如果在實踐中認爲multicoresnow結合使用會被罰款?將多核與雪羣結合起來

我想要做的是將我的負載分解爲在羣集中的多臺機器上以及每臺機器上運行。我想利用機器上的所有內核。對於這種類型的處理,將積雪與multicore混合是否合理?

+0

(你可以使用包含在'R'> = 2.14.0中'parallel'包,它基於'multicore'和'snow')。我會想到像'cl < - makeCluster(...); clusterEvalQ(cl,{library(parallel); cl < - makeCluster(4); parallel_custom_function < - function(cl,...){...}}',但是我會非常好奇地看到一些有效的代碼! – lockedoff 2012-07-24 18:48:52

+0

Segue是一個完美的例子,絕對沒有理由不能使用與segue相結合的多核,關鍵是要分散你的負載,並且理解你在哪裏使用開銷 – Dave 2012-08-04 18:12:17

+0

install.packages中的警告: package'parallel '不可用(對於R版本3.0.1) – 2013-07-18 14:24:19

回答

1

我已經使用lockoff上面提出的方法,即使用並行程序包在多個具有多個內核的機器上分配令人尷尬的並行工作負載。首先,工作負載分佈在所有機器上,然後每臺機器的工作負載分佈在所有內核上。這種方法的缺點是機器之間沒有負載平衡(至少我不知道如何)。

所有加載的r代碼應該是相同的,並且位於所有機器(svn)上的相同位置。由於初始化羣集需要相當長的一段時間,因此可以通過重新使用創建的羣集來改進以下代碼。

foo <- function(workload, otherArgumentsForFoo) { 
    source("/home/user/workspace/mycode.R") 
    ... 
} 

distributedFooOnCores <- function(workload) { 
    # Somehow assign a batch number to every record 
    workload$ParBatchNumber = NA 
    # Split the assigned workload into batches according to DistrParNumber 
    batches = by(workload, workload$ParBatchNumber, function(x) x) 

    # Create a cluster with workers on all machines 
    library("parallel") 
    cluster = makeCluster(detectCores(), outfile="distributedFooOnCores.log") 
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo) 
    stopCluster(cluster) 

    # Merge the resulting batches 
    results = someEmptyDataframe 
    p = 1; 
    for(i in 1:length(batches)){ 
     results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]] 
     p = p + nrow(batches[[i]])  
    } 

    # Clean up 
    workload$ParBatchNumber = NULL 
    return(invisible(results)) 
} 

distributedFooOnMachines <- function(workload) { 
    # Somehow assign a batch number to every record 
    workload$DistrBatchNumber = NA 
    # Split the assigned activity into batches according to DistrBatchNumber 
    batches = by(workload, workload$DistrBatchNumber, function(x) x) 

    # Create a cluster with workers on all machines 
    library("parallel") 
    # If makeCluster hangs, please make sure passwordless ssh is configured on all machines 
    cluster = makeCluster(c("machine1", "etc"), master="ub2", user="", outfile="distributedFooOnMachines.log") 
    batches = parLapply(cluster, batches, foo, otherArgumentsForFoo) 
    stopCluster(cluster) 

    # Merge the resulting batches 
    results = someEmptyDataframe 
    p = 1; 
    for(i in 1:length(batches)){ 
     results[p:(p + nrow(batches[[i]]) - 1), ] = batches[[i]] 
     p = p + nrow(batches[[i]])  
    } 

    # Clean up 
    workload$DistrBatchNumber = NULL 
    return(invisible(results)) 
} 

我很感興趣如何可以改善上述方法。