2012-02-23 38 views
0

我想做一個簡單的測試,以查看有無緩存未命中時的性能差異。查看緩存錯誤 - 簡單的C++緩存基準

我想看到,當在數組X上操作(X適合緩存)時,性能比數組Y(Y不適合緩存)要好。實際上,當緩存未命中開始影響性能時,我想要發現陣列的關鍵大小。

我做了一個簡單的函數來訪問一個循環中的數組。我應該得到一些performancearr_size適合緩存和其他arr_size不適合緩存。但是我獨立於arr_size獲得的性能更低,即使對於大尺寸(如20MB)也是如此。這是爲什麼?

// compiled without optimizations -O0 
float benchmark_cache(const size_t arr_size) 
{ 

    unsigned char* arr_a = (unsigned char*) malloc(sizeof(char) * arr_size); 
    unsigned char* arr_b = (unsigned char*) malloc(sizeof(char) * arr_size); 

    assert(arr_a); 
    assert(arr_b); 

    long time0 = get_nsec(); 

    for(size_t i = 0; i < arr_size; ++i) { 
     // index k will jump forth and back, to generate cache misses 
     size_t k = (i/2) + (i % 2) * arr_size/2; 
     arr_b[k] = arr_a[k] + 1; 
    } 

    long time_d = get_nsec() - time0; 
    float performance = float(time_d)/arr_size; 
    printf("perf %.1f [kB]: %d\n", 
     performance, 
     arr_size /1024); 

    free(arr_a); 
    free(arr_b); 

    return performance; 
} 

long get_nsec() 
{ 
    timespec ts; 
    clock_gettime(CLOCK_REALTIME, &ts); 
    return long(ts.tv_sec)*1000*1000 + ts.tv_nsec; 
} 

回答

0

您應該在緩存模擬模式下使用像cachegrind這樣的工具來獲得明智的結果。否則,緩存性能會受到調度程序工作導致的上下文切換的顯着影響。

+0

其實我是想*避免*使用cachegrid - 我想看到真正的高速緩存操作,而不是模擬 – 2012-02-23 12:17:26

+0

@JakubM。:首先使用cachegrind,因爲它給了你數字,當你發現了一些有趣的東西的時候,你可以看看它是否按預期工作。 – 2012-02-23 12:22:16

1

這很難說,但我的猜測是CPU的預測性和線性負載正在幫助你很多。也就是說,由於您按順序訪問數據,當您遇到未緩存的值時,CPU將加載下一個數據塊。這種加載基本上可以並行進行,因此您可能不會真的在等待加載。

我知道你試圖跳過,但讀/寫順序在本質上仍然是非常線性的。您只需遍歷兩個塊而不是1.嘗試使用便宜的隨機數生成器來跳過更多。

另請注意,%是一個相對較慢的操作,因此您可能會無意中測量該性能。不用優化編譯意味着它可能會使用mod運算符,而不是這裏的掩碼。嘗試在完全優化打開的情況下進行測試。

另外,一定要將您的線程設置爲具有實時優先級的固定cpu親和性(您如何執行此操作取決於您的操作系統)。這應該限制任何上下文切換開銷。

+0

哪個「mod運算符」? x86定義了一個div指令,其餘部分爲副產品。 – 2012-02-23 12:32:40

+0

我不知道哪個ASM負責,但是我從一些剖析知道'%'可以顯着減慢一段代碼(可能是任何的dicision)。嗯,重要的我當然意味着只有測量到納秒級。 – 2012-02-24 09:01:44

1

當您多次訪問相同的位置而不訪問太多其他位置時,會發生緩存性能改進。在這裏你只需訪問一次你分配的內存,你就不會看到太多的緩存效果。

即使您將代碼更改爲訪問整個數組的幾倍,緩存處理邏輯也會嘗試預測您的訪問權限,如果模式足夠簡單,則通常會成功。線性前向訪問(甚至分成兩部分)很簡單。

+0

通常,CPU以64字節的塊訪問RAM,所以當你請求第一個字節時,CPU將讀取和緩存相鄰的64字節。所以,即使使用topicstarter的示例CPU高速緩存可能也很重要。另外,硬件RAM預取器也可以優化存儲器訪問。 – 2016-08-31 22:00:58

0

我剛剛讀了what should I know about memory,並且玩過基準測試樣本。希望這可以幫助某人:

struct TimeLogger 
{ 
    const char* m_blockName; 
    const clock_t m_start; 

    TimeLogger(const char* blockName) : m_blockName(blockName), m_start(clock()) {} 
    ~TimeLogger()      
    { 
     clock_t finish = clock(); 
     std::cout << "Done: " << m_blockName << " in " << (finish - m_start) * 1000.0/CLOCKS_PER_SEC << " ms" << std::endl; 
    } 
}; 

const size_t k_ITERATIONS = 16; 
const size_t k_SIZE = 1024 * 1024 * 16; 

uint64_t test(const char* name, const std::vector<int64_t>& data, const std::vector<size_t>& indexes) 
{ 
    TimeLogger log = name; 

    uint64_t sum = 0; 
    for (size_t i = 0; i < k_ITERATIONS; ++i) 
     for (size_t index : indexes) 
      sum += data[index]; 

    return sum; 
} 

// return shuffled sequences of consecutive numbers like [0,1,2, 6,7,8, 3,4,5, ...] 
std::vector<size_t> fillSequences(size_t size, size_t seriesSize, std::mt19937 g) 
{ 
    std::vector<size_t> semiRandIdx; 
    semiRandIdx.reserve(size); 

    size_t i = 0; 
    auto semiRandSequences = std::vector<size_t>(size/seriesSize, 0); 
    std::generate(semiRandSequences.begin(), semiRandSequences.end(), [&i]() { return i++; }); 
    std::shuffle(semiRandSequences.begin(), semiRandSequences.end(), g); 

    for (size_t seqNumber : semiRandSequences) 
     for (size_t i = seqNumber * seriesSize; i < (seqNumber + 1) * seriesSize; ++i) 
      semiRandIdx.push_back(i); 

    return semiRandIdx; 
} 

int main() 
{ 
    std::random_device rd; 
    std::mt19937 g(rd()); 

    auto intData = std::vector<int64_t>(k_SIZE, 0); 
    std::generate(intData.begin(), intData.end(), g); 

    // [0, 1, 2, ... N] 
    auto idx = std::vector<size_t>(k_SIZE, 0); 
    std::generate(idx.begin(), idx.end(), []() {static size_t i = 0; return i++; }); 

    // [N, N-1, ... 0] 
    auto reverseIdx = std::vector<size_t>(idx.rbegin(), idx.rend()); 

    // random permutation of [0, 1, ... N] 
    auto randIdx = idx; 
    std::shuffle(randIdx.begin(), randIdx.end(), g); 

    // random permutations of 32, 64, 128-byte sequences 
    auto seq32Idx = fillSequences(idx.size(), 32/sizeof(int64_t), g); 
    auto seq64Idx = fillSequences(idx.size(), 64/sizeof(int64_t), g); 
    auto seq128Idx = fillSequences(idx.size(), 128/sizeof(int64_t), g); 

    size_t dataSize = intData.size() * sizeof(int64_t); 
    size_t indexSize = idx.size() * sizeof(int64_t); 
    std::cout << "vectors filled, data (MB): " << dataSize/1024/1024.0 << "; index (MB): " << indexSize/1024/1024.0 
     << "; total (MB): " << (dataSize + indexSize)/1024/1024.0 << std::endl << "Loops: " << k_ITERATIONS << std::endl; 

    uint64_t sum1 = test("regular access", intData, idx); 
    uint64_t sum2 = test("reverse access", intData, reverseIdx); 
    uint64_t sum3 = test("random access", intData, randIdx); 
    uint64_t sum4 = test("random 32-byte sequences", intData, seq32Idx); 
    uint64_t sum5 = test("random 64-byte sequences", intData, seq64Idx); 
    uint64_t sum6 = test("random 128-byte sequences", intData, seq128Idx); 

    std::cout << sum1 << ", " << sum2 << ", " << sum3 << ", " << sum4 << ", " << sum5 << ", " << sum6 << std::endl; 
    return 0; 
} 

好奇的是,CPU的預取器極大地優化了反向數組訪問。在將前向訪問時間與反向訪問進行比較時,我發現這一點:在我的PC上,性能是相同的。

這裏也是筆記本電腦的一些結果與2x32KB L1,2x256KB L2和3MB三級緩存:

vectors filled, data (MB): 512; index (MB): 512; total (MB): 1024 
Loops: 1 
Done: regular access in 147 ms 
Done: reverse access in 119 ms 
Done: random access in 2943 ms 
Done: random 32-byte sequences in 938 ms 
Done: random 64-byte sequences in 618 ms 
Done: random 128-byte sequences in 495 ms 

... 

vectors filled, data (MB): 4; index (MB): 4; total (MB): 8 
Loops: 512 
Done: regular access in 331 ms 
Done: reverse access in 334 ms 
Done: random access in 1961 ms 
Done: random 32-byte sequences in 1099 ms 
Done: random 64-byte sequences in 930 ms 
Done: random 128-byte sequences in 824 ms 

... 

vectors filled, data (MB): 1; index (MB): 1; total (MB): 2 
Loops: 2048 
Done: regular access in 174 ms 
Done: reverse access in 162 ms 
Done: random access in 490 ms 
Done: random 32-byte sequences in 318 ms 
Done: random 64-byte sequences in 295 ms 
Done: random 128-byte sequences in 257 ms 

... 

vectors filled, data (MB): 0.125; index (MB): 0.125; total (MB): 0.25 
Loops: 16384 
Done: regular access in 148 ms 
Done: reverse access in 139 ms 
Done: random access in 210 ms 
Done: random 32-byte sequences in 179 ms 
Done: random 64-byte sequences in 166 ms 
Done: random 128-byte sequences in 163 ms