共享GPU上的Tensorflow：如何自動選擇未使用的

我可以通過ssh訪問n個GPU的集羣。 Tensorflow自動給它們命名爲gpu：0，...，gpu：（n-1）。共享GPU上的Tensorflow：如何自動選擇未使用的

其他人也可以訪問，有時他們會隨機使用gpus。我沒有明確地放置任何tf.device()，因爲這很麻煩，即使我選擇了gpu編號j，並且某個人已經在編號爲j的gpu上會出現問題。

我想通過gpus的使用，找到第一個未使用的，只使用這一個。我猜有人可以用bash解析nvidia-smi的輸出，並得到一個變量i，並將該變量i作爲要使用的gpu的數量提供給tensorflow腳本。

我從來沒有見過這樣的例子。我想這是一個很常見的問題。最簡單的方法是什麼？純張量流可用嗎？

2017-01-13 jean

我不知道pure-TensorFlow解決方案。問題是TensorFlow配置的現有位置是會話配置。但是，對於GPU內存，GPU內存池在進程內共享所有TensorFlow會話，因此會話配置將是錯誤的地方添加它，並且沒有機制進程全局配置（但應該也是能夠配置進程全局特徵線程池）。因此，您需要使用CUDA_VISIBLE_DEVICES環境變量在流程級別上進行操作。

事情是這樣的：

import subprocess, re 

# Nvidia-smi GPU memory parsing. 
# Tested on nvidia-smi 370.23 

def run_command(cmd): 
    """Run command, return output as string.""" 
    output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0] 
    return output.decode("ascii") 

def list_available_gpus(): 
    """Returns list of available GPU ids.""" 
    output = run_command("nvidia-smi -L") 
    # lines of the form GPU 0: TITAN X 
    gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):") 
    result = [] 
    for line in output.strip().split("\n"): 
     m = gpu_regex.match(line) 
     assert m, "Couldnt parse "+line 
     result.append(int(m.group("gpu_id"))) 
    return result 

def gpu_memory_map(): 
    """Returns map of GPU id to memory allocated on that GPU.""" 

    output = run_command("nvidia-smi") 
    gpu_output = output[output.find("GPU Memory"):] 
    # lines of the form 
    # | 0  8734 C python          11705MiB | 
    memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB") 
    rows = gpu_output.split("\n") 
    result = {gpu_id: 0 for gpu_id in list_available_gpus()} 
    for row in gpu_output.split("\n"): 
     m = memory_regex.search(row) 
     if not m: 
      continue 
     gpu_id = int(m.group("gpu_id")) 
     gpu_memory = int(m.group("gpu_memory")) 
     result[gpu_id] += gpu_memory 
    return result 

def pick_gpu_lowest_memory(): 
    """Returns GPU with the least allocated memory""" 

    memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()] 
    best_memory, best_gpu = sorted(memory_gpu_map)[0] 
    return best_gpu

然後你可以把它放在utils.py並設置GPU在TensorFlow腳本之前先tensorflow進口。 IE

import utils 
import os 
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory()) 
import tensorflow

來源

2017-01-13 16:02:43

感謝您的高明回答！ – jean

顯然'nvidia-smi'在某些情況下可能會給出不匹配的設備編號，因此您必須將'lspci'合併爲正確的編號，如[152]中所述（https://github.com/tensorflow/tensorflow/issues/152＃issuecomment-273555972） –

我會檢查出來的謝謝！但到目前爲止，您的解決方案似乎對我來說工作得很好！ – jean

共享GPU上的Tensorflow：如何自動選擇未使用的

回答

相關問題