性能問題

在R，我想創建大量的數據文件中的元素的計數矩陣：性能問題

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames)) 

#loop through hundreds of large files (MB) 
#make the vector "names" contain all elements within each file 
for(name in names) 
{ 
#if name is already in the matrix increment by 1 the second row 
    if(name %in% colnames(mymatrix)) 
    { 
    mymatrix[2,name] = mymatrix[2,name]+1 
    } 
#else add a column to the matrix with the specified name 
    else 
    { 
    mymatrix <- transform(mymatrix,name) 
    mymatrix[2,name] = 1 
    }  
}

我跑了Rprof命令，發現匹配（）函數可能嵌入％以內％運算符是導致性能問題的原因之一（更長的執行時間）

是否有更有效的方法來檢查向量中的每個元素（如果它存在於我的矩陣中）增加它，如果它不創建新列在那個向量元素作爲列名的矩陣中？

如果你想在這裏有一個可重複的代碼，但是請記住，我的原始代碼中的名稱矢量是從包含數千個變量的大文件中讀取的，這些變量與mymatrix中不斷增加的列號匹配最終導致運行時間增加：

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2,ncol=0,dimnames=list(rnames)) 

#suppose this is what the first file contains 
names <- c("x","y","z","x","x","y","a") 

#suppose this is what the second file contains 
names <- c("x","y","z","x","x","x","x","k") 


    for(name in names) 
    { 
    if(name %in% colnames(mymatrix)) 
    { 
     mymatrix[2,name] = mymatrix[2,name] + 1 
    } 
    else 
    { 
     mymatrix <- transform(mymatrix,name) 
     mymatrix[2,name] = 1 
    } 

    } 


the expected output 
> mymatrix 
    x y z a k 
N NA NA NA NA NA 
A 8 3 2 1 1

來源

2016-05-22 Imlerith

我在代碼中看不到'names'。那應該是'rnames'嗎？ – Gopala

不，名稱是與rnames不同的另一個向量。我評論了我如何填寫該向量，但是如果您想要我可以提供的源代碼：mydataframe < - readRDS（file）names < - colnames（mydataframe） – Imlerith

您可以發佈一個可重複的示例，用最少的輸入和期望的輸出嗎？如果我們無法運行代碼，很難爲您提供幫助。 – Gopala

我不知道如何確定match是瓶頸。它可能是，但你提供的例子並沒有表明這一點。

rnames <- c("N","A") 
mymatrix <- matrix(nrow=2, ncol=0, dimnames=list(rnames)) 
set.seed(21) 
names <- sample(letters, 1e6, TRUE) 
Rprof() 
for(name in names) { 
    if(name %in% colnames(mymatrix)) { 
    mymatrix[2,name] <- mymatrix[2,name] + 1 
    } else { 
    mymatrix <- transform(mymatrix,name) 
    mymatrix[2,name] <- 1 
    } 
} 
Rprof(NULL)

下面的結果表明瓶頸是data.frame方法，這是所謂的由於您使用的transform。 transform.default將您的矩陣轉換爲data.frame，然後調用transform.data.frame，其中包括致電match。

R> lapply(summaryRprof(), head) 
$by.self 
       self.time self.pct total.time total.pct 
"[<-.data.frame"  12.02 26.15  25.90  56.35 
"[.data.frame"  7.22 15.71  13.32  28.98 
"match"    7.20 15.67  11.40  24.80 
"%in%"    2.38  5.18  12.34  26.85 
"anyDuplicated"  2.22  4.83  3.08  6.70 
"names"    2.16  4.70  2.16  4.70 

$by.total 
       total.time total.pct self.time self.pct 
"[<-"     27.06  58.88  1.16  2.52 
"[<-.data.frame"  25.90  56.35  12.02 26.15 
"["     14.32  31.16  1.00  2.18 
"[.data.frame"  13.32  28.98  7.22 15.71 
"%in%"    12.34  26.85  2.38  5.18 
"match"    11.40  24.80  7.20 15.67 

$sample.interval 
[1] 0.02 

$sampling.time 
[1] 45.96

避免撥打transform，您的代碼將顯着加快。 mymatrix2實際上是一個矩陣，而mymatrix是一個data.frame。

rnames <- c("N","A") 
mymatrix2 <- matrix(nrow=2, ncol=0, dimnames=list(rnames)) 
set.seed(21) 
names <- sample(letters, 1e6, TRUE) 
Rprof() 
for(name in names) { 
    if(name %in% colnames(mymatrix)) { 
    mymatrix2[2,name] <- mymatrix2[2,name] + 1 
    } else { 
    mymatrix2 <- cbind(mymatrix2, matrix(c(NA,1), 2, 1, dimnames=list(rnames, name))) 
    } 
} 
Rprof(NULL) 
lapply(summaryRprof(), head) 
$by.self 
       self.time self.pct total.time total.pct 
"match"    1.28 41.83  2.70  88.24 
"colnames"   0.78 25.49  1.42  46.41 
"is.data.frame"  0.58 18.95  0.58  18.95 
"%in%"    0.34 11.11  3.04  99.35 
"dimnames"   0.06  1.96  0.06  1.96 
"+"     0.02  0.65  0.02  0.65 

$by.total 
       total.time total.pct self.time self.pct 
"%in%"    3.04  99.35  0.34 11.11 
"match"    2.70  88.24  1.28 41.83 
"colnames"   1.42  46.41  0.78 25.49 
"is.data.frame"  0.58  18.95  0.58 18.95 
"dimnames"   0.06  1.96  0.06  1.96 
"+"     0.02  0.65  0.02  0.65 
identical(mymatrix2, as.matrix(mymatrix)) 
[1] TRUE

來源

2016-05-22 17:00:29

感謝您的詳細輸入，我注意到使用cbind而不是transform來處理僅一個文件夾中的文件，從7.46s提高到0.92。接下來我將在整個數據集上嘗試它 – Imlerith

回答

相關問題