數組大小和複製性能

我確信這已被回答，但我找不到一個好的解釋。數組大小和複製性能

我正在寫一個圖形程序，其中一部分管道將體素數據複製到OpenCL頁面鎖定（固定）內存。我發現這個複製過程是一個瓶頸，並對簡單的std::copy的性能做了一些測量。數據是浮動的，我想要複製的每一塊數據大小都在64 MB左右。

這是我的原代碼，在基準任何嘗試之前：

std::copy(data, data+numVoxels, pinnedPointer_[_index]);

凡data是一個浮動指針，numVoxels是一個unsigned int和pinnedPointer_[_index]是浮動指針引用一個固定的OpenCL緩衝。

由於我的表現慢，我決定嘗試複製較小的數據部分，而不是看看我得到了什麼樣的帶寬。我使用boost :: cpu_timer進行計時。我嘗試過運行一段時間以及平均數百次運行，得到了類似的結果。下面是與結果一起相關的代碼：

boost::timer::cpu_timer t;              
unsigned int testNum = numVoxels;            
while (testNum > 2) {               
    t.start();                 
    std::copy(data, data+testNum, pinnedPointer_[_index]);      
    t.stop();                 
    boost::timer::cpu_times result = t.elapsed();        
    double time = (double)result.wall/1.0e9 ;         
    int size = testNum*sizeof(float);           
    double GB = (double)size/1073741842.0;          
    // Print results 
    testNum /= 2;                
} 

Copied 67108864 bytes in 0.032683s, 1.912315 GB/s 
Copied 33554432 bytes in 0.017193s, 1.817568 GB/s 
Copied 16777216 bytes in 0.008586s, 1.819749 GB/s 
Copied 8388608 bytes in 0.004227s, 1.848218 GB/s 
Copied 4194304 bytes in 0.001886s, 2.071705 GB/s 
Copied 2097152 bytes in 0.000819s, 2.383543 GB/s 
Copied 1048576 bytes in 0.000290s, 3.366923 GB/s 
Copied 524288 bytes in 0.000063s, 7.776913 GB/s 
Copied 262144 bytes in 0.000016s, 15.741867 GB/s 
Copied 131072 bytes in 0.000008s, 15.213149 GB/s 
Copied 65536 bytes in 0.000004s, 14.374742 GB/s 
Copied 32768 bytes in 0.000003s, 10.209962 GB/s 
Copied 16384 bytes in 0.000001s, 10.344942 GB/s 
Copied 8192 bytes in 0.000001s, 6.476566 GB/s 
Copied 4096 bytes in 0.000001s, 4.999603 GB/s 
Copied 2048 bytes in 0.000001s, 1.592111 GB/s 
Copied 1024 bytes in 0.000001s, 1.600125 GB/s 
Copied 512 bytes in 0.000001s, 0.843960 GB/s 
Copied 256 bytes in 0.000001s, 0.210990 GB/s 
Copied 128 bytes in 0.000001s, 0.098439 GB/s 
Copied 64 bytes in 0.000001s, 0.049795 GB/s 
Copied 32 bytes in 0.000001s, 0.049837 GB/s 
Copied 16 bytes in 0.000001s, 0.023728 GB/s

有一個在複製的65536-262144字節的塊中一個明確的帶寬峯，帶寬比複製全陣列（15對2 GB/s的非常高的）。

知道了這一點，我決定嘗試另一件事，並複製完整的數組，但使用std::copy的重複調用，其中每個調用只處理數組的一部分。嘗試不同的塊大小，這些都是我的結果：

unsigned int testNum = numVoxels;            
unsigned int parts = 1;              
while (sizeof(float)*testNum > 256) {           
    t.start();                 
    for (unsigned int i=0; i<parts; ++i) {          
    std::copy(data+i*testNum, 
       data+(i+1)*testNum, 
       pinnedPointer_[_index]+i*testNum); 
    }                   
    t.stop();                 
    boost::timer::cpu_times result = t.elapsed();        
    double time = (double)result.wall/1.0e9;         
    int size = testNum*sizeof(float);           
    double GB = parts*(double)size/1073741824.0;        
    // Print results 
    parts *= 2;                 
    testNum /= 2;                
}  

Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s 
Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s 
Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s 
Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s 
Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s 
Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s 
Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s 
Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s 
Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s 
Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s 
Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s 
Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s 
Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s 
Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s 
Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s 
Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s 
Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s 
Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s

好像減小塊大小居然有顯著的影響，但我仍然不能得到近15 GB/s的任何地方。

我運行64位Unbuntu，GCC優化沒有太大的區別。

爲什麼數組大小會以這種方式影響帶寬？
OpenCL固定內存是否起到一部分作用？
什麼是優化大型陣列副本的策略？

來源

2013-05-20 Victor Sand

您可能會遇到您的操作系統頁面錯誤系統。它可能以64k塊交換內存。 –

可以通過指針或引用傳遞數組而不是複製它？ –

確保您重複運行測試。 –

我很確定你正在運行緩存顛簸。如果你用你寫的數據填充緩存，下一次需要一些數據，緩存將不得不從內存中讀取數據，但是首先它需要在緩存中找到一些空間 - 因爲所有的數據[或至少很多]是「髒」的，因爲它已被寫入，它需要被寫入到RAM中。接下來，我們向緩存中寫入一些新數據，這會拋出另一些髒數據（或者我們之前讀過的）。

在彙編程序中，我們可以通過使用「非暫時」移動指令來克服這個問題。例如，SSE指令movntps。該指令將「避免將內容存儲在緩存中」。

編輯：您也可以通過不混合讀取和寫入來獲得更好的性能 - 使用4-16KB的小緩衝區[固定大小的數組]並將數據複製到該緩衝區，然後將該緩衝區寫入新的位置你想要它。同樣，理想情況下使用非暫時寫入，因爲即使在這種情況下，這也會提高吞吐量 - 但只是使用「塊」來讀取然後寫入，而不是讀取一個寫入，速度會更快。

事情是這樣的：

float temp[2048]; 
    int left_to_do = numVoxels; 
    int offset = 0; 

    while(left_to_do) 
    { 
     int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); 
     std::copy(data+offset, data+offset+block, temp);      
     std::copy(temp, temp+block, pinnedPointer_[_index+offet]);      
     offset += block; 
     left_to_do -= block; 
    }

試一下，看看它是否提高的東西。它可能不會...

編輯2：我應該解釋，這是更快，因爲你重新使用相同位的緩存加載數據到每一次，並通過不混合的閱讀和寫作，我們得到更好的性能從內存本身。

來源

2013-05-20 21:15:59

謝謝，這很有幫助。這有點凌駕於我的頭上，但我一定會考慮它並嘗試您的建議。 –

另一個後續步驟：我拷入的內存是否由OpenCL鎖定頁面，這在其中扮演着什麼角色？ –

@VictorSand不，不應該。頁面鎖定只是意味着它不能被分頁或移動。所以除非你的記憶力很短，否則應該沒有問題。請注意，15GB/s大約是寫入緩存的處理器的最高性能。你無法維持這段時間。 –

數組大小和複製性能

回答

相關問題