在計算後停止Tensoflow在GPU上運行

我在Python中運行REST服務器，使用訪問點檢索圖像並使用張量流模型來預測該圖像上的內容。啓動服務器後，我將圖像發送到REST端點。加載的模型是我訓練自己的初始模型。它從張量流檢查點文件加載以恢復權重。這裏是建立圖形並執行該分類功能：在計算後停止Tensoflow在GPU上運行

import os 
import tensorflow as tf 

from cnn_server.server import file_service as dirs 
from slim.datasets import dataset_utils 
from slim.nets import nets_factory as network_factory 
from slim.preprocessing import preprocessing_factory as preprocessing_factory 

def inference_on_image(bot_id, image_file, network_name='inception_v4', return_labels=1): 

     model_path = dirs.get_model_data_dir(bot_id) 

     # Get number of classes to predict 
     protobuf_dir = dirs.get_protobuf_dir(bot_id) 
     number_of_classes = dataset_utils.get_number_of_classes_by_labels(protobuf_dir) 

     # Get the preprocessing and network construction functions 
     preprocessing_fn = preprocessing_factory.get_preprocessing(network_name, is_training=False) 
     network_fn = network_factory.get_network_fn(network_name, number_of_classes) 

     # Process the temporary image file into a Tensor of shape [widht, height, channels] 
     image_tensor = tf.gfile.FastGFile(image_file, 'rb').read() 
     image_tensor = tf.image.decode_image(image_tensor, channels=0) 

     # Perform preprocessing and reshape into [network.default_width, network.default_height, channels] 
     network_default_size = network_fn.default_image_size 
     image_tensor = preprocessing_fn(image_tensor, network_default_size, network_default_size) 

     # Create an input batch of size one from the preprocessed image 
     input_batch = tf.reshape(image_tensor, [1, 299, 299, 3]) 

     # Create the network up to the Predictions Endpoint 
     logits, endpoints = network_fn(input_batch) 

     restorer = tf.train.Saver() 

     with tf.Session() as sess: 
      tf.global_variables_initializer().run() 

      # Restore the variables of the network from the last checkpoint and run the graph 
      restorer.restore(sess, tf.train.latest_checkpoint(model_path)) 
      sess.run(endpoints) 

      # Get the numpy array of predictions out of the 
      predictions = endpoints['Predictions'].eval()[0] 
      sess.close() 

     return map_predictions_to_labels(protobuf_dir, predictions, return_labels)

要構建我用tf.model.slim，國家的最先進的CCNS實現tensorflow的集合先啓V4模型的曲線圖。成立之初模型建在這裏：https://github.com/tensorflow/models/blob/master/slim/nets/inception_v4.py，並通過工廠方法提供：https://github.com/tensorflow/models/blob/master/slim/nets/nets_factory.py

對於預期everythig工作的第一形象：

2017-07-17 18:00:43.831365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-07-17 18:00:43.831371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 
2017-07-17 18:00:43.831384: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0) 
192.168.0.192 - - [17/Jul/2017 18:00:46] "POST /classify/4 HTTP/1.1" 200 -

第二圖像創建下列錯誤：

ValueError: Variable InceptionV4/Conv2d_1a_3x3/weights already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

我對此的理解是，該圖最初是創建的，然後保持在某個地方。發送第二個圖像會導致再次調用該函數，嘗試重新創建現有圖形，然後重新出現錯誤。現在，我已經嘗試了一些東西：

停止Tensorflow整體： 我試圖阻止tensorflow整體和GPU上的每一次重新創建設備。這將是最好的解決方案，因爲在服務器運行時，GPU不會被Tensorflow佔用。我試圖用sess.close()做到這一點，但沒有奏效。 nvidia-smi仍然在處理第一個圖像後在GPU上顯示該進程。然後我嘗試以某種方式訪問設備，但是我可以通過device_lib.list_local_devices()獲得可用設備的列表。然而，這並沒有導致任何處理GPU上張量流程的選項。停止服務器，即啓動張量流會話的初始python腳本也會消除GPU上的張量流。每次分類後重新啓動服務器都不是一個很好的解決方案。

重置或刪除圖表 我嘗試以幾種方式重置Graph。一種方法是檢索我運行的是張圖表，遍歷所有集合和清除它們：

graph = endpoints['Predictions'].graph 
for key in graph.get_all_collection_keys(): 
    graph.clear_collection(key)

調試表明，圖形集合是空之後，然而誤差保持不變。另一種方法是將端點的圖形設置爲默認圖形with graph.as_default:，因爲在創建圖形之前我沒有太多希望這會在計算後刪除圖形。它沒有。

設置變量範圍reuse=true 變量範圍有一個選項重用，您可以在inception_v4.py設置。

def inception_v4(inputs, num_classes=1001, is_training=True, 
       dropout_keep_prob=0.8, 
       reuse=None, 
       scope='InceptionV4', 
       create_aux_logits=True):

將其設置爲true會導致最初創建圖表時出錯，表示變量不存在。

加載模型一次，然後resuing它 我想到了另外一個辦法是將創建模型，然後就重用，即避免再次撥打網絡工廠。現在這是有問題的，因爲服務器擁有幾個模型，每個模型在不同數量的類上工作。這意味着，我將不得不爲這些模型中的每一個創建圖形，讓它們保持活動並以某種方式維護它們。雖然這是可能的，但它會導致很多開銷，並且有點多餘，因爲模型總是相同的，只是權重和最後一層不同。權重已存儲在檢查點文件中，並且tf.model.slim中的實現允許輕鬆創建具有不同輸出類數的圖。

我在這裏沒有想法。最理想的解決方案當然是完全終止GPU上的張量流，並在每次調用函數時重新創建設備。

希望有人能幫到這裏。

在此先感謝。

來源

2017-07-17 molig

我找到了解決問題就在這裏：https://stackoverflow.com/a/44842044/7208993

的想法是在一個過程，它執行後終止執行該功能。結果可以通過與Manager()對象共享變量來保留。雖然這可能不是最優雅的解決方案，但tensorflow現在似乎並沒有提供更好的方法。由於GPU在服務器運行的整個過程中並未被Tensorflow佔用，這已經足夠了。代碼現在看起來像這樣：

def inference_on_image(bot_id, image_file, network_name='inception_v4', return_labels=1): 
     manager = Manager() 
     prediction_dict = manager.dict() 
     process = multiprocessing.Process(target=infere, args=(bot_id, image_file, network_name, return_labels, prediction_dict)) 
     process.start() 
     process.join() 
     return prediction_dict['predictions'] 


    def infere(bot_id, image_file, network_name='inception_v4', return_labels=1, prediction_dict=[]): 
     # Get the model path 
     model_path = dirs.get_model_data_dir(bot_id) 

     # Get number of classes to predict 
     protobuf_dir = dirs.get_protobuf_dir(bot_id) 
     number_of_classes = dataset_utils.get_number_of_classes_by_labels(protobuf_dir) 

     # Get the preprocessing and network construction functions 
     preprocessing_fn = preprocessing_factory.get_preprocessing(network_name, is_training=False) 
     network_fn = network_factory.get_network_fn(network_name, number_of_classes) 

     # Process the temporary image file into a Tensor of shape [widht, height, channels] 
     image_tensor = tf.gfile.FastGFile(image_file, 'rb').read() 
     image_tensor = tf.image.decode_image(image_tensor, channels=0) 

     # Perform preprocessing and reshape into [network.default_width, network.default_height, channels] 
     network_default_size = network_fn.default_image_size 
     image_tensor = preprocessing_fn(image_tensor, network_default_size, network_default_size) 

     # Create an input batch of size one from the preprocessed image 
     input_batch = tf.reshape(image_tensor, [1, 299, 299, 3]) 

     # Create the network up to the Predictions Endpoint 
     logits, endpoints = network_fn(input_batch) 

     restorer = tf.train.Saver() 

     with tf.Session() as sess: 
      tf.global_variables_initializer().run() 

      # Restore the variables of the network from the last checkpoint and run the graph 
      restorer.restore(sess, tf.train.latest_checkpoint(model_path)) 
      sess.run(endpoints) 

      # Get the numpy array of predictions out of the 
      predictions = endpoints['Predictions'].eval()[0] 
      sess.close() 
      graph = endpoints['Predictions'].graph 

      prediction_dict['predictions'] = map_predictions_to_labels(protobuf_dir, predictions, return_labels)

來源

2017-07-17 17:29:05 molig

讓我們一個接一個地探討你的問題。

首先，有關已存在變量的錯誤來自您重新使用現有圖形並在每個請求上重新運行模型創建代碼。通過在inference_on_image函數中添加with tf.Graph().as_default():上下文管理器，或者（強烈推薦）通過將網絡上的session.run的那部分函數與模型構建和權重加載分開來重新使用該圖，可以爲每個請求創建一個圖。

對於第二個問題，沒有辦法讓tensorflow重置它的GPU狀態而不殺死整個進程。

對於第三個問題，清除圖形集合不會有太大的作用。您可以針對每個請求使用新圖形，但默認情況下它仍會共享變量的狀態，因爲它們將駐留在GPU上。你可以使用session.reset清除那個狀態，但是這不會讓你的內存回到你的內存。

要重複使用不同數量的類的模型，同時分享權重，這聽起來像你需要有一個函數來構造它們的全部。我認爲最好的方法是將slim方法的實現更改爲最後一層，然後讓自己的代碼在完全連接的層上添加適當數量的類。

當然，除非將所有模型一起訓練，否則您可能仍然需要網絡其餘部分的不同參數值。

來源

2017-07-17 20:39:24

坦克的答覆。重用圖表意味着在內存中保留幾個相當大的初始v4模型。這對我來說似乎不是一個好主意，特別是因爲tf doenst似乎提供了刪除圖的任何方式。此外問題仍然存在，tensorflow仍然佔據着GPU。使用標準配置，其他進程在此期間不能在GPU上執行任何操作，這也是不理想的。我選擇在專用線程中啓動該功能，在推斷後終止（請參閱下面的答案）。這對我來說最合適。 – molig

在計算後停止Tensoflow在GPU上運行

回答

相關問題