我用Python比較了theano(CPU),theano(GPU)和Scikit-learn(CPU)的處理時間。 但是,我得到了奇怪的結果。 這裏看看我繪製的圖。爲什麼在GPU上Skyline比Theano更快?
處理時間比較:
你可以看到的結果scikit學習比theano(GPU)更快。 我檢查它的經過時間的程序是從一個有n * 40個元素的矩陣計算歐幾里德距離矩陣。
這是代碼的一部分。
points = T.fmatrix("points")
edm = T.zeros_like(points)
def get_point_to_points_euclidean_distances(point_id):
euclideans = (T.sqrt((T.sqr(points- points[point_id, : ])).sum(axis=1)))
return euclideans
def get_EDM_CPU(points):
EDM = np.zeros((points.shape[0], points.shape[0])).astype(np.float32)
for row in range(points.shape[0]):
EDM[row, :] = np.sqrt(np.sum((points - points[row, :])**2, axis=1))
return EDM
def get_sk(points):
EDM = sk.pairwise_distances(a, metric='l2')
return EDM
seq = T.arange(T.shape(points)[0])
(result, _) = theano.scan(fn = get_point_to_points_euclidean_distances, \
outputs_info = None , \
sequences = seq)
get_EDM_GPU = theano.function(inputs = [points], outputs = result, allow_input_downcast = True)
我認爲GPU比sci-kit學習慢的原因可能是轉換時間。所以我用nvprof命令分析了GPU。然後我得到了這個。
==27105== NVPROF is profiling process 27105, command: python ./EDM_test.py
Using gpu device 0: GeForce GTX 580 (CNMeM is disabled, cuDNN not available)
data shape : (10000, 40)
get_EDM_GPU elapsed time : 1.84863090515 (s)
get_EDM_CPU elapsed time : 8.09937691689 (s)
get_EDM_sk elapsed time : 1.10968112946 (s)
ratio : 4.38128395145
==27105== Profiling application: python ./EDM_test.py
==27105== Warning: Found 9 invalid records in the result.
==27105== Warning: This could be because device ran out of memory when profiling.
==27105== Profiling result:
Time(%) Time Calls Avg Min Max Name
71.34% 1.28028s 9998 128.05us 127.65us 128.78us kernel_reduce_01_node_316e2e1cbfbe8cfb8e4a101f329ffeec_0(int, int, float const *, int, int, float*, int)
19.95% 357.97ms 9997 35.807us 35.068us 36.948us kernel_Sub_node_bc41b3f8f12c93d29f2c4360ad445d80_0_2(unsigned int, int, int, float const *, int, int, float const *, int, int, float*, int, int)
7.32% 131.38ms 2 65.690ms 1.2480us 131.38ms [CUDA memcpy DtoH]
1.25% 22.456ms 9996 2.2460us 2.1140us 2.8420us kernel_Sqrt_node_23508f8f49d12f3e8369d543f5620c15_0_Ccontiguous(unsigned int, float const *, float*)
0.12% 2.1847ms 1 2.1847ms 2.1847ms 2.1847ms [CUDA memset]
0.01% 259.73us 5 51.946us 640ns 250.36us [CUDA memcpy HtoD]
0.00% 17.086us 1 17.086us 17.086us 17.086us kernel_reduce_ccontig_node_97496c4d3cf9a06dc4082cc141f918d2_0(unsigned int, float const *, float*)
0.00% 2.0090us 1 2.0090us 2.0090us 2.0090us void copy_kernel<float, int=0>(cublasCopyParams<float>)
轉印[CUDA的memcpy DtoH]進行兩次{ 1.248 [us], 131.38 [ms] }
[CUDA的memcpy HtoD]進行5倍{ min: 640 [ns], max: 250.36 [us] }
傳送時間爲約131.639毫秒(131.88毫秒+ 259.73我們的轉印)。但是GPU和scikit-learn之間的差距大約是700ms(1.8s - 1.1s)所以差距超過了傳輸時間。
它是否僅從對稱矩陣計算上三角矩陣?
什麼讓scikit學習如此之快?
第一張圖上的CPU和GPU是什麼? – Worthy7
@ Worthy7 CPU意味着它使用for-loop語句計算歐幾里得距離矩陣,而GPU意味着它通過使用帶有GPU的theano庫來計算矩陣。 – Holden
不,但CPU是一塊硬件,SKLEARN是一個python框架。你不能只把這兩個放在圖上。無論如何,我認爲你的意思是運行純python vs使用sklearn。 sklearn內部優化 - 就是這麼簡單。 請嘗試更大的數據集:)讓我們假設100X更大 – Worthy7