怪異的結果從輸出nvprof

計算存儲器帶寬如何計算GPU存儲器帶寬與下式給出：怪異的結果從輸出nvprof

數據樣本的大小（以Gb）。
內核執行時間（nvprof輸出）。

GPU：gtx 1050 ti
Cuda的：8.0
OS：Windows 10
IDE：Visual studio 2015

通常我會用這個公式：bandwidth [Gb/s] = data_size [Gb]/average_time [s]。

但是，當我使用get_mem_kernel()內核的上述公式時，我得到了錯誤的結果：441,93 [Gb/s]。

我認爲這個結果是錯誤的，因爲在gtx 1050 ti的技術規格表示全局存儲器帶寬爲112 [Gb\s]。

我在哪裏犯了一個錯誤，或者有什麼我不明白的地方？

樣品的編號：

// cpp libs: 
#include <iostream> 
#include <sstream> 
#include <fstream> 
#include <iomanip> 
#include <stdexcept> 

// cuda libs: 
#include <cuda_runtime.h> 
#include <device_launch_parameters.h> 

#define ERROR_CHECK(CHECK_) if (CHECK_ != cudaError_t::cudaSuccess) { std::cout << "cuda error" << std::endl; throw std::runtime_error("cuda error"); } 

using data_type = double; 

template <typename T> constexpr __forceinline__ 
T div_s(T dividend, T divisor) 
{ 
    using P = double; 
    return static_cast <T> (static_cast <P> (dividend + divisor - 1)/static_cast <P> (divisor)); 
} 

__global__ 
void set_mem_kernel(const unsigned int size, data_type * const in_data) 
{ 
    int idx = blockIdx.x * blockDim.x + threadIdx.x; 
    if (idx < size) 
    { 
     in_data[idx] = static_cast <data_type> (idx); 
    } 
} 

__global__ 
void get_mem_kernel(const unsigned int size, data_type * const in_data) 
{ 
    int idx = blockIdx.x * blockDim.x + threadIdx.x; 
    data_type val = 0; 
    if (idx < size) 
    { 
     val = in_data[idx]; 
    } 
} 

struct quit_program 
{ 
public: 
    ~quit_program() 
    { 
     try 
     { 
      ERROR_CHECK(cudaDeviceReset()); 
     } 
     catch (...) {} 
    } 
} quit; 

int main() 
{ 
    unsigned int size = 12500000; // 100 mb; 
    size_t  byte = size * sizeof(data_type); 

    dim3 threads (256, 1, 1); 
    dim3 blocks (div_s(size, threads.x), 1, 1); 

    std::cout << size << std::endl; 
    std::cout << byte << std::endl; 
    std::cout << std::endl; 

    std::cout << threads.x << std::endl; 
    std::cout << blocks.x << std::endl; 
    std::cout << std::endl; 

    // data: 
    data_type * d_data = nullptr; 
    ERROR_CHECK(cudaMalloc(&d_data, byte)); 

    for (int i = 0; i < 20000; i++) 
    { 
     set_mem_kernel <<<blocks, threads>>> (size, d_data); 
     ERROR_CHECK(cudaDeviceSynchronize()); 
     ERROR_CHECK(cudaGetLastError()); 

     get_mem_kernel <<<blocks, threads>>> (size, d_data); 
     ERROR_CHECK(cudaDeviceSynchronize()); 
     ERROR_CHECK(cudaGetLastError()); 
    } 

    // Exit: 
    ERROR_CHECK(cudaFree(d_data)); 
    ERROR_CHECK(cudaDeviceReset()); 
    return EXIT_SUCCESS; 
}

nvproof結果：

D:\Dev\visual_studio\nevada_test_site\x64\Release>nvprof ./cuda_test.exe 
12500000 
100000000 

256 
48829 

==10508== NVPROF is profiling process 10508, command: ./cuda_test.exe 
==10508== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory 
==10508== Profiling application: ./cuda_test.exe 
==10508== Profiling result: 
Time(%)  Time  Calls  Avg  Min  Max Name 
81.12% 19.4508s  20000 972.54us 971.22us 978.32us set_mem_kernel(unsigned int, double*) 
18.88% 4.52568s  20000 226.28us 224.45us 271.14us get_mem_kernel(unsigned int, double*) 

==10508== API calls: 
Time(%)  Time  Calls  Avg  Min  Max Name 
97.53% 26.8907s  40000 672.27us 247.98us 1.7566ms cudaDeviceSynchronize 
    1.61% 443.32ms  40000 11.082us 5.8340us 183.43us cudaLaunch 
    0.51% 141.10ms   1 141.10ms 141.10ms 141.10ms cudaMalloc 
    0.16% 43.648ms   1 43.648ms 43.648ms 43.648ms cudaDeviceReset 
    0.08% 22.182ms  80000  277ns  0ns 121.07us cudaSetupArgument 
    0.06% 15.437ms  40000  385ns  0ns 24.433us cudaGetLastError 
    0.05% 12.929ms  40000  323ns  0ns 57.253us cudaConfigureCall 
    0.00% 1.1932ms  91 13.112us  0ns 734.09us cuDeviceGetAttribute 
    0.00% 762.17us   1 762.17us 762.17us 762.17us cudaFree 
    0.00% 359.93us   1 359.93us 359.93us 359.93us cuDeviceGetName 
    0.00% 8.3880us   1 8.3880us 8.3880us 8.3880us cuDeviceTotalMem 
    0.00% 2.5520us   3  850ns  364ns 1.8230us cuDeviceGetCount 
    0.00% 1.8240us   3  608ns  365ns 1.0940us cuDeviceGet

CUDA Samples\v8.0\1_Utilities\bandwidthTest結果：

[CUDA Bandwidth Test] - Starting... 
Running on... 

Device 0: GeForce GTX 1050 Ti 
Quick Mode 

Host to Device Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
    Transfer Size (Bytes)  Bandwidth(MB/s) 
    33554432      11038.4 

Device to Host Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
    Transfer Size (Bytes)  Bandwidth(MB/s) 
    33554432      11469.6 

Device to Device Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
    Transfer Size (Bytes)  Bandwidth(MB/s) 
    33554432      95214.0 

Result = PASS 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

來源

2017-09-05 cukier9a7b5

在Samples/1_Utilities文件夾中運行'bandwidthTest'，爲您估算卡的實際情況。代碼也不難理解，會給你一些指示。 – zindarod

你可能會碰到其中一個緩存，這意味着你會感受到更高的帶寬。但nvprof提供的指標可能會給你一個比任何你可能試圖計算自己更好的衡量指標。 [這]（https://stackoverflow.com/questions/37732735/nvprof-option-for-bandwidth/37740119#37740119）可能是有趣的。 –

您正在構建調試項目還是發佈項目？對於發佈項目，您的'get_mem_kernel'不會執行任何會影響正在讀取的數據的全局狀態，因此編譯器可以自由地優化實際負載。您可以通過查看內核反彙編來確認這一點，或向分析器詢問實際獲得的帶寬。 –

編譯器優化掉存儲器中讀取。有人指出Robert Crovella。感謝您的幫助 - 我永遠不會猜到它。

詳細：
我的編譯器正在優化掉val變量和擴展內存讀取。

來源

2017-09-06 00:12:17 cukier9a7b5

怪異的結果從輸出nvprof

回答

相關問題