我已經使用ddply和gDistance(package:rgeos)來鏈接兩組多邊形。但是,對於我的大型數據集,該過程非常緩慢,正如許多博客中所解釋的(例如,Why is plyr so slow?)。如何在data.table中調用gDistance?
這些博客表明,使用data.table會更快,但我不能解決如何使它適用於我的情況。特別是,如何在data.table內爲我的棲息地清除數據的一部分應用gDistance?
下面我會更詳細地解釋我正在做什麼,並附上一小部分測試代碼(這裏是:https://www.dropbox.com/sh/9onaeltb81qrd7h/AAAdED0KS6n2EzP74F3o81Lxa?dl=0)。
我有兩個包含多邊形的形狀文件(其中一個爲生境清理補丁,其他屬性爲清理棲息地而受到懲罰),我想在第一個層(棲息地清理)中標識多邊形中最接近每個多邊形的多邊形第二層(受到懲罰的屬性),因爲兩者並不總是完全重疊。另外,還有一個額外的限制;匹配的棲息地清除多邊形不能比環境犯罪大5歲以上。
我已經成功地使用了以下ddply代碼 - 關於如何使用data.table做到這一點,我認爲它會更快... ...?
謝謝!
# # # # # # # # # # # # # # # # # # # Load [R] GIS packages. # # # # # # # # # # # # # # # # # #
library(rgeos)
library(raster)
library(rgdal)
library(plyr)
# # # # # # # # # # # # # # # # # # # ENTER shapefile information HERE # # # # # # # # # # # # # # # # # # # # # #
# What is the name of the punishments shapefile?
punishments <- "punishments_stack_overflow"
# What is the name of the PUNISHMENTs directory?
myDrctry <- "E:/Esri_arcGIS_datasets/SM_data/IBAMA_embargo/final_embargo_list/near_chopped/stack_overflow" # !CHANGE ME - where the shapefiles are stored
# What is the name of the hab_cl shapefile?
hab_cl_shp <- "RO_SimU_deforestation_Amazonia_SIRGAS_near"
# What is the name of the hab_cl data directory?
my_hab_cl_Drctry <- "E:/Esri_arcGIS_datasets/SM_data/PRODES/Deforestation_per_SimU/near_analysis" #! CHANGE ME
# # # # # # # # # # # # # # # # # # # # Load the shapefiles # # # # # # # # # # # # # # # # # # # # # #
# Read in the embargo shapefile
punishments_need_near <- readOGR(dsn=myDrctry, layer=punishments)
# Identify the attributes to keep
myattributes <- c("numero_tad", "data_tad", "CD_BIOMA")
# Subset the full dataset extracting only the desired attributes
[email protected] <- [email protected][,myattributes]
# Load the deforestation data
hab_cl_patches_near <- readOGR(dsn=my_hab_cl_Drctry, layer=hab_cl_shp)
proj4string(hab_cl_patches_near) # check the projection (which is SIRGAS 2000 UTM)
#[email protected] <- [email protected][,c("year","LAPIG_ID")] # manipulate the columns to match oter dataframes
#names([email protected]) <- c("ano", "PRODES_ID")
head(hab_cl_patches_near) # check that it worked
# # # # # # # # # # # # # # # # # # # # # # # Run the loop # # # # # # # # # # # # # # # # # # # # # # #
# Use ddply to calculate nearest distance for each embargo ("numero_tad")
tmp <- ddply([email protected], .(numero_tad), function(x) { # numero_tad is a unique identifier per punishment
ID <- x$numero_tad[1]
tmp.punishments <- punishments_need_near[[email protected]$numero_tad == ID,]
tmp.patches <- hab_cl_patches_near[([email protected]$ano +5) >= [email protected][,"ano_new"] & # match the punishments with habitat clearance in last 5 years (ano_new = yr of punishment, ano = yr of habitat clearance)
[email protected]$ano <= [email protected][,"ano_new"],] # and not after the punishment itself
obj <- gDistance(tmp.punishments, tmp.patches, byid=TRUE) # calculate the distance between each punishment and patch of habitat clearance
df <- data.frame(numero_tad = ID, PRODES_ID = tmp.patches$PRODES_ID[which.min(obj)], dist = min(obj)) # link punishment with the nearest suitable patch of habitat clearance
}, .progress='text') # progress bar
head(tmp)
聽起來像是使用'.BY'的機會。有關這方面的一些示例,請參閱https://github.com/Rdatatable/data.table/issues/1363 – MichaelChirico