在R,我想創建大量的數據文件中的元素的計數矩陣:性能問題
rnames <- c("N","A")
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames))
#loop through hundreds of large files (MB)
#make the vector "names" contain all elements within each file
for(name in names)
{
#if name is already in the matrix increment by 1 the second row
if(name %in% colnames(mymatrix))
{
mymatrix[2,name] = mymatrix[2,name]+1
}
#else add a column to the matrix with the specified name
else
{
mymatrix <- transform(mymatrix,name)
mymatrix[2,name] = 1
}
}
我跑了Rprof命令,發現匹配()函數可能嵌入%以內%運算符是導致性能問題的原因之一(更長的執行時間)
是否有更有效的方法來檢查向量中的每個元素(如果它存在於我的矩陣中)增加它,如果它不創建新列在那個向量元素作爲列名的矩陣中?
如果你想在這裏有一個可重複的代碼,但是請記住,我的原始代碼中的名稱矢量是從包含數千個變量的大文件中讀取的,這些變量與mymatrix中不斷增加的列號匹配最終導致運行時間增加:
rnames <- c("N","A")
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames))
#suppose this is what the first file contains
names <- c("x","y","z","x","x","y","a")
#suppose this is what the second file contains
names <- c("x","y","z","x","x","x","x","k")
for(name in names)
{
if(name %in% colnames(mymatrix))
{
mymatrix[2,name] = mymatrix[2,name] + 1
}
else
{
mymatrix <- transform(mymatrix,name)
mymatrix[2,name] = 1
}
}
the expected output
> mymatrix
x y z a k
N NA NA NA NA NA
A 8 3 2 1 1
我在代碼中看不到'names'。那應該是'rnames'嗎? – Gopala
不,名稱是與rnames不同的另一個向量。我評論了我如何填寫該向量,但是如果您想要我可以提供的源代碼:mydataframe < - readRDS(file)names < - colnames(mydataframe) – Imlerith
您可以發佈一個可重複的示例,用最少的輸入和期望的輸出嗎?如果我們無法運行代碼,很難爲您提供幫助。 – Gopala