2013-12-16 68 views
2

我需要執行矩陣向量乘法,其中矩陣是複數,對稱的並且具有四個非對角線非零帶。到目前爲止,我正在使用稀疏BLAS例程mkl_zdiasymv來執行乘法,並且它在一個內核上工作正常。我想嘗試一下,如果我可以通過使用多線程(例如openMP)獲得性能提升。據我所知,一些(很多?)的MKL例程是通過線程化的。但是,如果我使用 mkl_set_num_threads(4) 我的程序仍然在單個線程上運行。如何執行使用MKL的線程化稀疏矩陣 - 向量乘法?

要在這裏給出一個具體的例子是一個小的測試程序,我編譯(使用ICC 14.01):

icc mkl_test_mp.cpp -mkl -std=c++0x -openmp 

mkl_test_mp.cpp:

#include <complex> 
#include <vector> 
#include <iostream> 
#include <chrono> 

typedef std::complex<double> complex; 
using std::vector; 
using namespace std::chrono; 

#define MKL_Complex16 std::complex<double> 
#include "mkl.h" 

int vector_dimension = 10000000; 
int number_of_multiplications = 100; 

vector<complex> initialize_matrix() { 

    complex value_main_diagonal   = complex(1, 2); 
    complex value_sub_and_super_diagonal = complex(3, 4); 
    complex value_far_off_diagonal  = complex(5, 6); 

    std::vector<complex> matrix; 
    matrix.resize(1 * vector_dimension, value_main_diagonal); 
    matrix.resize(2 * vector_dimension, value_sub_and_super_diagonal); 
    matrix.resize(3 * vector_dimension, value_far_off_diagonal); 

    return matrix; 
} 

vector<complex> perform_matrix_vector_calculation(vector<complex>& matrix, const vector<complex>& x) { 

    mkl_set_num_threads(4); 

    vector<complex> result(vector_dimension); 

    char uplo = 'L'; // since the matrix is symmetric we only need to declare one triangular part of the matrix (here the lower one) 
    int number_of_nonzero_diagonals = 3; 
    vector<int> matrix_diagonal_offsets = {0, -1, -int(sqrt(vector_dimension))}; 

    complex *x_data = const_cast<complex* >(x.data()); // I do not like this, but mkl expects non const pointer (??) 

    mkl_zdiasymv (
      &uplo, 
      &vector_dimension, 
     matrix.data(), 
     &vector_dimension, 
     matrix_diagonal_offsets.data(), 
     &number_of_nonzero_diagonals, 
     x_data, 
     result.data() 
    ); 
    return result; 
} 

void print(vector<complex>& x) { 
    for(complex z : x) 
    std::cerr << z; 
    std::cerr << std::endl; 
} 

void run() { 
    vector<complex> matrix = initialize_matrix(); 
    vector<complex> current_vector(vector_dimension, 1); 

    for(int i = 0; i < number_of_multiplications; ++i) { 
     current_vector = perform_matrix_vector_calculation(matrix, current_vector); 
    } 
    std::cerr << current_vector[0] << std::endl; 
} 

int main() { 

    auto start = steady_clock::now(); 

    run(); 

    auto end = steady_clock::now(); 
    std::cerr << "runtime = " << duration<double, std::milli> (end - start).count() << " ms" << std::endl; 
    std::cerr << "runtime per multiplication = " << duration<double, std::milli> (end -  start).count()/number_of_multiplications << " ms" << std::endl; 
    } 

它甚至有可能並行本辦法 ?我究竟做錯了什麼 ?是否有其他建議來加速乘法?

回答

2

由於您未展示如何編譯代碼,您能否檢查您是否正在鏈接多線程英特爾MKL庫和例如並行線程?

例如(這是一箇舊版本的MKL):

THREADING_LIB="$(MKL_PATH)/libmkl_$(IFACE_THREADING_PART)_thread.$(EXT)" 
OMP_LIB = -L"$(CMPLR_PATH)" -liomp5 

應該有一個例子目錄中的MKL的分佈,例如intel/composer_xe_2011_sp1.10.319/mkl/examples。在那裏你可以檢查spblasc/makefile的內容,看看如何正確鏈接你的特定版本的MKL的多線程庫。

另一個應該加快速度的建議是增加編譯器優化標誌,例如,

OPT_FLAGS = -xHost -O3

允許icc來生成你的架構優化的代碼,所以你的行會最終爲:

icc mkl_test_mp.cpp -mkl -std=c++0x -openmp -xHost -O3