2016-05-22 219 views
1

在R,我想創建大量的數據文件中的元素的計數矩陣:性能問題

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames)) 

#loop through hundreds of large files (MB) 
#make the vector "names" contain all elements within each file 
for(name in names) 
{ 
#if name is already in the matrix increment by 1 the second row 
    if(name %in% colnames(mymatrix)) 
    { 
    mymatrix[2,name] = mymatrix[2,name]+1 
    } 
#else add a column to the matrix with the specified name 
    else 
    { 
    mymatrix <- transform(mymatrix,name) 
    mymatrix[2,name] = 1 
    }  
} 

我跑了Rprof命令,發現匹配()函數可能嵌入%以內%運算符是導致性能問題的原因之一(更長的執行時間)

是否有更有效的方法來檢查向量中的每個元素(如果它存在於我的矩陣中)增加它,如果它不創建新列在那個向量元素作爲列名的矩陣中?

如果你想在這裏有一個可重複的代碼,但是請記住,我的原始代碼中的名稱矢量是從包含數千個變量的大文件中讀取的,這些變量與mymatrix中不斷增加的列號匹配最終導致運行時間增加:

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames)) 

#suppose this is what the first file contains 
names <- c("x","y","z","x","x","y","a") 

#suppose this is what the second file contains 
names <- c("x","y","z","x","x","x","x","k") 


    for(name in names) 
    { 
    if(name %in% colnames(mymatrix)) 
    { 
     mymatrix[2,name] = mymatrix[2,name] + 1 
    } 
    else 
    { 
     mymatrix <- transform(mymatrix,name) 
     mymatrix[2,name] = 1 
    } 

    } 


the expected output 
> mymatrix 
    x y z a k 
N NA NA NA NA NA 
A 8 3 2 1 1 
+0

我在代碼中看不到'names'。那應該是'rnames'嗎? – Gopala

+0

不,名稱是與rnames不同的另一個向量。我評論了我如何填寫該向量,但是如果您想要我可以提供的源代碼:mydataframe < - readRDS(file)names < - colnames(mydataframe) – Imlerith

+0

您可以發佈一個可重複的示例,用最少的輸入和期望的輸出嗎?如果我們無法運行代碼,很難爲您提供幫助。 – Gopala

回答

1

我不知道如何確定match是瓶頸。它可能是,但你提供的例子並沒有表明這一點。

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2, ncol=0, dimnames=list(rnames)) 
set.seed(21) 
names <- sample(letters, 1e6, TRUE) 
Rprof() 
for(name in names) { 
    if(name %in% colnames(mymatrix)) { 
    mymatrix[2,name] <- mymatrix[2,name] + 1 
    } else { 
    mymatrix <- transform(mymatrix,name) 
    mymatrix[2,name] <- 1 
    } 
} 
Rprof(NULL) 

下面的結果表明瓶頸是data.frame方法,這是所謂的由於您使用的transformtransform.default將您的矩陣轉換爲data.frame,然後調用transform.data.frame,其中包括致電match

R> lapply(summaryRprof(), head) 
$by.self 
       self.time self.pct total.time total.pct 
"[<-.data.frame"  12.02 26.15  25.90  56.35 
"[.data.frame"  7.22 15.71  13.32  28.98 
"match"    7.20 15.67  11.40  24.80 
"%in%"    2.38  5.18  12.34  26.85 
"anyDuplicated"  2.22  4.83  3.08  6.70 
"names"    2.16  4.70  2.16  4.70 

$by.total 
       total.time total.pct self.time self.pct 
"[<-"     27.06  58.88  1.16  2.52 
"[<-.data.frame"  25.90  56.35  12.02 26.15 
"["     14.32  31.16  1.00  2.18 
"[.data.frame"  13.32  28.98  7.22 15.71 
"%in%"    12.34  26.85  2.38  5.18 
"match"    11.40  24.80  7.20 15.67 

$sample.interval 
[1] 0.02 

$sampling.time 
[1] 45.96 

避免撥打transform,您的代碼將顯着加快。 mymatrix2實際上是一個矩陣,而mymatrix是一個data.frame。

rnames <- c("N","A") 
mymatrix2 <- matrix(nrow=2, ncol=0, dimnames=list(rnames)) 
set.seed(21) 
names <- sample(letters, 1e6, TRUE) 
Rprof() 
for(name in names) { 
    if(name %in% colnames(mymatrix)) { 
    mymatrix2[2,name] <- mymatrix2[2,name] + 1 
    } else { 
    mymatrix2 <- cbind(mymatrix2, matrix(c(NA,1), 2, 1, dimnames=list(rnames, name))) 
    } 
} 
Rprof(NULL) 
lapply(summaryRprof(), head) 
$by.self 
       self.time self.pct total.time total.pct 
"match"    1.28 41.83  2.70  88.24 
"colnames"   0.78 25.49  1.42  46.41 
"is.data.frame"  0.58 18.95  0.58  18.95 
"%in%"    0.34 11.11  3.04  99.35 
"dimnames"   0.06  1.96  0.06  1.96 
"+"     0.02  0.65  0.02  0.65 

$by.total 
       total.time total.pct self.time self.pct 
"%in%"    3.04  99.35  0.34 11.11 
"match"    2.70  88.24  1.28 41.83 
"colnames"   1.42  46.41  0.78 25.49 
"is.data.frame"  0.58  18.95  0.58 18.95 
"dimnames"   0.06  1.96  0.06  1.96 
"+"     0.02  0.65  0.02  0.65 
identical(mymatrix2, as.matrix(mymatrix)) 
[1] TRUE 
+1

感謝您的詳細輸入,我注意到使用cbind而不是transform來處理僅一個文件夾中的文件,從7.46s提高到0.92。接下來我將在整個數據集上嘗試它 – Imlerith