犰狳C++：如何高效地分配到子域

我想將矩陣的值分配給另一個矩陣的子域作爲A.submat(ni1, ni2, nk1, nk2) = B;看起來非常慢。我想知道爲什麼它如此緩慢，有什麼方法可以改善它嗎？犰狳C++：如何高效地分配到子域

這裏是我的測試代碼（因爲函數「XForwarDifference」需要被調用數以百萬計的時間在我的項目，我需要更好地描述文件時）

#include <armadillo> 
#include <chrono> 
#include <iostream> 
using ms = std::chrono::milliseconds; 
using ns = std::chrono::nanoseconds; 
using get_time = std::chrono::steady_clock; 

namespace { 
    const arma::ivec::fixed<5> iforward = {-1, 0, 1, 2, 3}; 
    const double MCA_1 = -0.30874; 
    const double MCA0 = -0.6326; 
    const double MCA1 = 1.2330; 
    const double MCA2 = -0.3334; 
    const double MCA3 = 0.04168; 
} 

arma::mat XForwardDifference(arma::mat& mat, 
          const int& ni1, 
          const int& ni2, 
          const int& nk1, 
          const int& nk2, 
          const double& dx) 
{ 
    arma::mat ret(size(mat)); 
    double sign_dx = 1./dx; 
    double mca_1= sign_dx * MCA_1; 
    double mca0 = sign_dx * MCA0; 
    double mca1 = sign_dx * MCA1; 
    double mca2 = sign_dx * MCA2; 
    double mca3 = sign_dx * MCA3; 
    auto t1 = get_time::now(); 
    auto m1 = mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2); 
    auto t2 = get_time::now(); 
    auto m2 = mca_1 * m1; 
    auto t3 = get_time::now(); 
    auto m3 = m2 + m2; 
    auto t4 = get_time::now(); 
    mat.submat(ni1, nk1, ni2, nk2) = m3; 
    auto t5 = get_time::now(); 
    std::cout << std::chrono::duration_cast<ns>(t2-t1).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t3-t2).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t4-t3).count() << std::endl; 
    std::cout << std::chrono::duration_cast<ns>(t5-t4).count() << std::endl; 


    // ret.submat(ni1, nk1, ni2, nk2) = 
    // mca_1* mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2) + 
    // mca0 * mat.submat(ni1+iforward(1), nk1, ni2+iforward(1), nk2) + 
    // mca1 * mat.submat(ni1+iforward(2), nk1, ni2+iforward(2), nk2) + 
    // mca2 * mat.submat(ni1+iforward(3), nk1, ni2+iforward(3), nk2) + 
    // mca3 * mat.submat(ni1+iforward(4), nk1, ni2+iforward(4), nk2); 
    return ret; 
} 


int main(int argc, char *argv[]) 
{ 
    const int len = 3; 
    int ni1, ni2, nk1, nk2, ni, nk; 
    ni = 200; 
    nk = 200; 
    ni1 = len; 
    ni2 = ni1 + ni - 1; 
    nk1 = len; 
    nk2 = nk1 + nk - 1; 
    const double dx = 1.; 
    auto start_time = get_time::now(); 
    arma::mat mat(ni + 2*len, nk + 2*len); 
    mat = XForwardDifference(mat, ni1, ni2, nk1, nk2, dx); 
    auto end_time = get_time::now(); 
    auto diff = end_time - start_time; 
    std::cout << "Elapsed time is : " 
      << std::chrono::duration_cast<ns>(diff).count() 
      << " ns " 
      << std::endl; 
    return 0; 
}

輸出是：

180 
116 
110 
851123 
Elapsed time is : 961975 ns

您可以看到mat.submat(ni1, nk1, ni2, nk2) = m3;涵蓋了大部分已用時間。

hbrerkere給出的理由：

，直到結果被分配到矩陣或子矩陣犰狳排隊的所有操作。這就是爲什麼分配給submat似乎需要更長時間。它實際上是在分配時間進行乘法和加法，而不是之前。另外，不要在Armadillo矩陣和表達式中使用auto關鍵字，因爲這可能會導致問題。 - hbrerkere

auto t1 = get_time::now(); 
    arma::mat m1 = mat.submat(ni1+iforward(0), nk1, ni2+iforward(0), nk2); 
    auto t2 = get_time::now(); 
    arma::mat m2 = mca_1 * m1; 
    auto t3 = get_time::now(); 
    arma::mat m3 = m2 + m2; 
    auto t4 = get_time::now(); 
    ret = m3; 
    auto t5 = get_time::now();

如果我修改代碼如他所說，那麼現在輸出低於：

391880 
356480 
373072 
113051 
Elapsed time is : 1352013 ns

我也遇到這種情況該auto會帶來問題的犰狳。

auto m1 = mat.submat(ni1, nk1, ni2, nk2) * 2; 
cout << size(m1) << endl;

它會打印非常大的尺寸，這是不正確的。

來源

2017-08-30 Aristotle0

Armadillo將所有操作排隊，直到結果分配給矩陣或子矩陣。這就是爲什麼分配給submat似乎需要更長時間。它實際上是在分配時間進行乘法和加法，而不是之前。另外，請勿在Armadillo矩陣和表達式中使用_auto_關鍵字，因爲這可能會導致問題。 – hbrerkere

@hbrerkere你的回答對我很有幫助。如果我使用'arma :: mat'而不是'auto'，那麼成本時間就像你說的那樣改變了。 – Aristotle0

您的代碼中似乎沒有具體的一點，應該讓它變慢。你在做什麼看起來很好。我建議的唯一的事情是，你應該儘量減少submat調用的數量，因爲它可能會創建一個臨時矩陣，因爲它被認爲相對昂貴。考慮在沒有submat的情況下做數學計算。只需計算索引並自己分配它們。

鑑於這種情況，我還有其他兩點建議：

如果硬要有可能是壞了你的代碼什麼的改進，研究替代的方式來寫它（如果可能），並使用function profiler ，比如Valgrind，看看哪些功能成本最高，以及是否可以優化。
如果你放棄這一點，考慮學習如何使用C++進行多線程。現在，這很容易。要麼使用OpenMP，這是非常容易的...但是你必須做到這一點，因爲它容易犯錯誤（例如high contention）;或使用std::thread，這是一個相對較新的C++構造，它只是簡單地在一個線程中運行一個函數。對於你的情況，由於你的應用程序是反覆的，你可以使用它們中的任何一個。您可以使用my simple thread pool實現，該實現一旦完成就會安排新實例的調用。

祝你好運。

來源

2017-08-30 12:33:04

犰狳C++：如何高效地分配到子域

回答

相關問題