R中的並行處理不使用所有內核

我試圖並行化在矩陣行上運行的進程。我預計，對於該行的每個元素（它是一個物種），它會提取並寫入與每個物種在其棲息地上的分佈相對應的文件（柵格）。R中的並行處理不使用所有內核

Habitas圖層是一個柵格文件，每個物種分佈是一個shapefile中的多邊形（或多邊形組）。我首先將物種多邊形轉換爲柵格，然後提取物種的棲息地（存儲在物種棲息地代碼與棲息地柵格值匹配的矩陣中），最後相交（乘以）分佈和棲息地。

此外，我想產生一個豐富度（物種地圖數量）文件（光柵）。然後，我將（總和）添加到每個最終物種分佈的空柵格（值爲零）。我寫了如下函數：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover) 
{ 
    #Libraries 
    library(rgdal) 
    library(raster) 
    #raster file with zeros 
    richness_cur=raster("richness_current.tif") 
    #Selection of species polygons 
    rows=as.numeric(which(as.character([email protected]$binomial)== 
          as.character(spp_data$binomial[k]))) 
    spp_poly=spp_polygons[rows,] 
    #Covert polygon(s) to raster 
    spp_poly=rasterize(spp_poly,cover,1,background=0) 
    #Match species habitats codes with habitats raster values 
    habs=as.character(spp_data$hab_code[k]) 
    habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";" 
    cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab 
             [which(as.character(habitat_codes[,1])%in%habs)])) 
    #Intersect species distributions with habitats 
    cov_mask=spp_poly*cover 
    #Extract species habitats 
    cov_mask=Which(cov_mask%in%cov_classes) 
    writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep="")) 
    #Sum species richness 
    richness_cur=richness_cur+cov_mask 
    return (richness_cur) 
}

我嘗試使用clusterApply和foreach函數並行化進程。然而，我無法從這個函數中返回一個柵格對象（這是顯而易見的東西來獲得一個正則循環函數）在這兩個函數中的任何一個函數中添加物體豐富度的總和。所以，這是我的第一個問題。 1.有人知道如何在並行化過程中返回與列表，矩陣或向量不同的對象嗎？

我解決了在每次「迭代」中編寫豐富文件的問題。不過，這個選項會導致這個過程變慢，所以對我而言，這並不理想。然後，該函數被改寫如下：

extract_habitats=function(k,spp_data,spp_polygons,sep,habitat_codes,cover) 
{ 
    #Libraries 
    library(rgdal) 
    library(raster) 
    #raster file with zeros 
    richness_cur=raster("richness_current.tif") 
    #Selection of species polygons 
    rows=as.numeric(which(as.character([email protected]$binomial)== 
          as.character(spp_data$binomial[k]))) 
    spp_poly=spp_polygons[rows,] 
    #Covert polygon(s) to raster 
    spp_poly=rasterize(spp_poly,cover,1,background=0) 
    #Match species habitats codes with habitats raster values 
    habs=as.character(spp_data$hab_code[k]) 
    habs=unlist(strsplit(habs, split=sep))#habitat codes are separeted by a ";" 
    cov_classes=as.numeric(as.character(habitat_codes[,2]#Get the hab 
             [which(as.character(habitat_codes[,1])%in%habs)])) 
    #Intersect species distributions with habitats 
    cov_mask=spp_poly*cover 
    #Extract species habitats 
    cov_mask=Which(cov_mask%in%cov_classes) 
    writeRaster(cov_mask,paste(spp_data$binomial[k]," current.tif",sep="")) 
    #Sum species richness 
    richness_cur=richness_cur+cov_mask 
    writeRaster(richness_cur,"richness_current.tif") 
}

完整代碼運行並行化是：

#Number of cores 
no_cores=detectCores()-1 
#Initiate cluster 
cl=makeCluster(no_cores,type="PSOCK") 
registerDoParallel(cl) 

#Table with name and habitat information (columns) for each species (rows) 
spp_data=read.xlsx2("species_file.xls",sheetIndex=1) 
#Shape file with species distributions as polygons 
spp_polygons=readOGR("distributions.shp") 
#Separation symbol for species habitats stored in spp_data 
sep=";" 
#Tabla joining habitas species codes with habitats raster 
habitat_codes=read.xlsx2("spp_habitats_final.xls",sheetIndex=1) 
#Habitats raster 
cover=raster("Z:/Data/cover_2015_proj_fixed_reclas_1km.tif") 

#Paralelization 
foreach(k=1:nrow(spp_data)) %dopar% extract_habitats(k=k, 
                spp_data=spp_data, 
                spp_polygons=spp_polygons,sep=sep, 
                habitat_codes=habitat_codes, 
                cover=cover) 
stopImplicitCluster() 
stopCluster(cl)

的平行化處理運行;然而，它並沒有如我所料，因爲它沒有使用所有的核心：Image of processors working。所以，並行化過程的作用是啓動39（核心數量）進程：Image of processes opened，但它不會逐個寫入文件，我可以期望在常規循環中。它突然寫了39個文件（我能理解的東西）的塊，但需要花費很多時間（因爲它似乎在少量內核中工作），甚至比運行常規循環更多（運行常規循環，每個文件都被寫入每兩或三分鐘，而39個文件的塊大約每一小時寫入一次）。

所以，這是我的第二組問題。 2.我做得不好？ 3.爲什麼它沒有使用所有39個處理器，或者它使用它們，爲什麼它沒有在最高級別使用它們？ 4.爲什麼它在完成一個任務時不開始另一個任務（我想這是因爲它總是以39個塊的形式寫入文件）？

在此先感謝您的幫助。

乾杯，

海梅

來源

2017-09-09 Jaime

如果沒有數據來重現您的示例，這將非常困難。 –

有誰知道如何在一個並行的過程返回從列表，矩陣，向量或不同的對象？

對於你的第一個問題，它沒有任何意義。你想要返回什麼樣的對象？一個列表可以包含任何R對象。

爲什麼它沒有使用所有39個處理器，或者它使用它們，爲什麼它不會在最大級別使用它們？

有很多潛在的原因。從查看你的代碼，一個原因可能是磁盤IO限制，因爲你正在寫大量的圖像到磁盤。另一個潛在的原因是內存大小限制。

我在做什麼壞事？

如果您正在使用Linux（或任何非Windows），您應該使用基準R並行包中的mclapply函數。

來源

2017-09-09 19:53:38 thc

R中的並行處理不使用所有內核

回答

相關問題