使用Armadillo和OpenBLAS進行多線程時性能不一致

使用Armadillo我寫了一個矩陣向量乘法和一個線性系統求解。犰狳是從源代碼編譯並使用OpenBLAS，也從源代碼編譯。不幸的是，我得到了單線程和多線程運行的不一致結果。矩陣向量乘法在單線程上運行得更快，而線程系統求解在多線程時運行得更快。我希望如果有人能夠給我一些關於我做錯了什麼的指示。使用Armadillo和OpenBLAS進行多線程時性能不一致

見下文：

源碼
編譯&運行bash腳本
結果
系統信息

matmul_armadillo.cpp

#include <armadillo> 

using namespace arma; 

int main(int argc, char *argv[]) 
{ 
    const int n = atoi(argv[1]); 

    mat A = randu<mat>(n, n); 
    vec x = randu<vec>(n); 

    A*x; 

    return 0; 
}

solve_armadillo.cpp

#include <armadillo> 

using namespace arma; 

int main(int argc, char *argv[]) 
{ 
    const int n = atoi(argv[1]); 

    mat A = randu<mat>(n, n); 
    vec b = randu<vec>(n); 
    vec x; 

    x = solve(A, b); 

    return 0; 
}

benchmark.sh

#!/bin/bash 

g++ matmul_armadillo.cpp -o matmul_armadillo -O3 -march=native -std=c++11 -larmadillo 
g++ solve_armadillo.cpp -o solve_armadillo -O3 -march=native -std=c++11 -larmadillo 

N=7500 

export OPENBLAS_NUM_THREADS=1 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=2 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=3 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=4 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=5 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=6 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=7 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N 
echo '' 

export OPENBLAS_NUM_THREADS=8 
echo 'Running matmul_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./matmul_armadillo $N 
echo '' 
echo 'Running solve_armadillo on' $OPENBLAS_NUM_THREADS 'threads' 
time ./solve_armadillo $N

結果

$ ./benchmark.sh 
Running matmul_armadillo on 1 threads 

real 0m0.943s 
user 0m0.628s 
sys  0m0.159s 

Running solve_armadillo on 1 threads 

real 0m13.910s 
user 0m13.553s 
sys  0m0.300s 

Running matmul_armadillo on 2 threads 

real 0m1.528s 
user 0m1.361s 
sys  0m0.402s 

Running solve_armadillo on 2 threads 

real 0m15.815s 
user 0m29.097s 
sys  0m1.083s 

Running matmul_armadillo on 3 threads 

real 0m1.534s 
user 0m1.480s 
sys  0m0.533s 

Running solve_armadillo on 3 threads 

real 0m11.729s 
user 0m31.022s 
sys  0m1.290s 

Running matmul_armadillo on 4 threads 

real 0m1.543s 
user 0m1.619s 
sys  0m0.674s 

Running solve_armadillo on 4 threads 

real 0m10.013s 
user 0m34.055s 
sys  0m1.696s 

Running matmul_armadillo on 5 threads 

real 0m1.545s 
user 0m1.620s 
sys  0m0.664s 

Running solve_armadillo on 5 threads 

real 0m9.945s 
user 0m33.803s 
sys  0m1.669s 

Running matmul_armadillo on 6 threads 

real 0m1.543s 
user 0m1.607s 
sys  0m0.684s 

Running solve_armadillo on 6 threads 

real 0m10.069s 
user 0m34.283s 
sys  0m1.699s 

Running matmul_armadillo on 7 threads 

real 0m1.542s 
user 0m1.622s 
sys  0m0.661s 

Running solve_armadillo on 7 threads 

real 0m10.041s 
user 0m34.154s 
sys  0m1.704s 

Running matmul_armadillo on 8 threads 

real 0m1.546s 
user 0m1.576s 
sys  0m0.712s 

Running solve_armadillo on 8 threads 

real 0m10.123s 
user 0m34.492s 
sys  0m1.697s

系統信息

的openSUSE 13.1 64位
犰狳4.100.2（從源代碼編譯）
OpenBLAS 0.2.8（從源代碼編譯）

來源

2014-03-27 Aeronaelius

您可能想訪問[OpenBLAS wiki]（https://github.com/xianyi/OpenBLAS/issues），因爲您更有可能在那裏獲得回覆 – mtall

我懷疑

A*x;

可能因爲你已經被優化掉不對結果做任何事情。 Armadillo乘法運算的延遲評估模板魔術很容易導致計算的Lapack例程永遠不會被調用。所以如果你啓用了線程，你只能測量設置的開銷。因此，您的程序在禁用線程的情況下執行得更快。

隨着

x = solve(A, b);

它是不同的，因爲這會導致相當直接到相應的LAPACK呼叫，這可能不能被優化掉，因爲編譯器不能排除副作用，你居然把結果賦給變量。 solve稱此類大型矩陣具有多處理優勢。

爲了解決您的基準你應該做兩兩件事：

。利用計算的結果，從做太多
重複計算多次停止優化，以獲得更好的統計數據，並減少的初始設置的影響成本

下面是一個未經測試的例子：

#include <iostream> 
#include <armadillo> 

using namespace arma; 

int main(int argc, char *argv[]) 
{ 
    const int n = atoi(argv[1]); 

    mat A = randu<mat>(n, n); 
    vec x = randu<vec>(n); 

    for (int i = 0; i < 100; ++i) { 
     x = A*x; 
    } 
    x.print(std::cout); 

    return 0; 
}

可能不需要呼叫print。

來源

2014-03-31 16:50:01 rerx

使用Armadillo和OpenBLAS進行多線程時性能不一致

回答

相關問題