Tensorflow：在GPU和CPU

同時預測我與tensorflow工作，我想通過同時使用加快預測階段預先訓練Keras模型（我不感興趣，在訓練階段）的CPU和一個GPU。Tensorflow：在GPU和CPU

我試着創建兩個不同的線程，它們提供兩個不同的tensorflow會話（一個運行在CPU上，另一個運行在GPU上）。每個線程提供固定數量的批處理（例如，如果我們總共有100個批處理，我想爲循環中的CPU分配20個批處理，或者在GPU上分配80個批處理，或者將這兩個批處理任意組合），並將結果合併。如果分割是自動完成的話會更好。

然而，即使在這種情況下，批處理似乎是以同步方式進行饋送，因爲即使向CPU發送少量批次並計算GPU中的所有其他批量（以GPU爲瓶頸），我觀察到總體預測時間總是高於僅使用GPU進行的測試。

我認爲它會更快，因爲當只有GPU工作時，CPU使用率約爲20-30％，因此有一些CPU可用來加速計算。

我讀了很多討論，但他們都處理與多GPU的並行性，而不是在GPU和CPU之間。

這裏是我所編寫的代碼的一個示例：

def predict_on_device(session, predict_tensor, batches): 
    for batch in batches: 
     session.run(predict_tensor, feed_dict={x: batch}) 


def split_cpu_gpu(batches, num_batches_cpu, tensor_cpu, tensor_gpu): 
    session1 = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
    session1.run(tf.global_variables_initializer()) 
    session2 = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 
    session2.run(tf.global_variables_initializer()) 

    coord = tf.train.Coordinator() 

    t_cpu = Thread(target=predict_on_device, args=(session1, tensor_cpu, batches[:num_batches_cpu])) 
    t_gpu = Thread(target=predict_on_device, args=(session2, tensor_gpu, batches[num_batches_cpu:])) 

    t_cpu.start() 
    t_gpu.start() 

    coord.join([t_cpu, t_gpu]) 

    session1.close() 
    session2.close()

：

with tf.device('/gpu:0'): 
    model_gpu = load_model('model1.h5') 
    tensor_gpu = model_gpu(x) 

with tf.device('/cpu:0'): 
    model_cpu = load_model('model1.h5') 
    tensor_cpu = model_cpu(x)

然後，預測如下完成：tensor_cpu和tensor_gpu對象從以這種方式相同Keras模型加載

我該如何實現這種CPU/GPU並行？我想我錯過了一些東西。

任何形式的幫助將非常感激！

來源

2017-05-30 battuzz

我有沒有回答你的問題？ – MaxB

是的，是的，是的！對於遲到的回答，我很抱歉，我忙於另一個項目，而且我沒有時間去嘗試。我檢查了你的代碼..它可能是它沒有工作的唯一原因是intra_op_parallelism_thread選項？ – battuzz

關於如何讓tensorflow找到合適的批量以供給CPU和GPU以便我可以最小化總預測時間？ – battuzz

這裏是我的代碼，演示瞭如何CPU和GPU的執行可以並行完成：

import tensorflow as tf 
import numpy as np 
from time import time 
from threading import Thread 

n = 1024 * 8 

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32) 
data_gpu = np.random.uniform(size=[n , n]).astype(np.float32) 

with tf.device('/cpu:0'): 
    x = tf.placeholder(name='x', dtype=tf.float32) 

def get_var(name): 
    return tf.get_variable(name, shape=[n, n]) 

def op(name): 
    w = get_var(name) 
    y = x 
    for _ in range(8): 
     y = tf.matmul(y, w) 
    return y 

with tf.device('/cpu:0'): 
    cpu = op('w_cpu') 

with tf.device('/gpu:0'): 
    gpu = op('w_gpu') 

def f(session, y, data): 
    return session.run(y, feed_dict={x : data}) 


with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess: 
    sess.run(tf.global_variables_initializer()) 

    coord = tf.train.Coordinator() 

    threads = [] 

    # comment out 0 or 1 of the following 2 lines: 
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))] 
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))] 

    t0 = time() 

    for t in threads: 
     t.start() 

    coord.join(threads) 

    t1 = time() 


print t1 - t0

時機結果是：

CPU線程：4-5s（將機器有所不同，當然）。
GPU線程：5s（它的工作量是16x）。
都在同一時間：5秒

注意，有沒有必要有2次會議（但也爲我工作）。

的原因，你可能會看到不同的結果可能是

一些爭奪系統資源（GPU執行確實會消耗一些主機系統資源，如果運行CPU線程排擠它，這可能會惡化性能）
不正確時機
模型只能在GPU/CPU運行的一部分
瓶頸其他地方
其他一些問題

來源

2017-05-30 20:56:09 MaxB

Tensorflow：在GPU和CPU

回答

相關問題