2017-01-23 60 views
2

我試圖通過檢查點保存變量,以將容錯引入到我的程序中。我試圖通過使用MonitoredTrainingSession函數來實現此目的。以下是我的配置: -Tensorflow:圖已完成,無法修改

import tensorflow as tf 

global_step = tf.Variable(10, trainable=False, name='global_step') 
x = tf.constant(2) 

with tf.device("/job:local/task:0"): 
    y1 = tf.Variable(x + 300) 

with tf.device("/job:local/task:1"): 
    y2 = tf.Variable(x**2) 

with tf.device("/job:local/task:2"): 
    y3 = tf.Variable(5*x) 

with tf.device("/job:local/task:3"): 
    y0 = tf.Variable(x - 66) 
    y = y0 + y1 + y2 + y3 

model = tf.global_variables_initializer() 
saver = tf.train.Saver(sharded=True) 

chief = tf.train.ChiefSessionCreator(scaffold=None, master='grpc://localhost:2222', config=None, checkpoint_dir='/home/tensorflow/codes/checkpoints') 
summary_hook = tf.train.SummarySaverHook(save_steps=None, save_secs=10, output_dir='/home/tensorflow/codes/savepoints', summary_writer=None, scaffold=None, summary_op=tf.summary.tensor_summary(name="y", tensor=y)) 
saver_hook = tf.train.CheckpointSaverHook(checkpoint_dir='/home/tensorflow/codes/checkpoints', save_secs=None, save_steps=True, saver=saver, checkpoint_basename='model.ckpt', scaffold=None) 

# with tf.train.MonitoredSession(session_creator=ChiefSessionCreator,hooks=[saver_hook, summary_hook]) as sess: 

with tf.train.MonitoredTrainingSession(master='grpc://localhost:2222', is_chief=True, checkpoint_dir='/home/tensorflow/codes/checkpoints', 
    scaffold=None, hooks=[saver_hook,summary_hook], chief_only_hooks=None, save_checkpoint_secs=None, save_summaries_steps=True, config=None) as sess: 

    while not sess.should_stop(): 
     sess.run(tf.global_variables_initializer()) 

    while not sess.should_stop(): 
     result = sess.run(y) 
     print(result) 

我得到以下RuntimeError對此我無法解析: -

Traceback (most recent call last): 
    File "add_1.py", line 39, in <module> 
    sess.run(tf.global_variables_initializer()) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1187, in global_variables_initializer 
    return variables_initializer(global_variables()) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 1169, in variables_initializer 
    return control_flow_ops.group(*[v.initializer for v in var_list], name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2773, in group 
    deps.append(_GroupControlDeps(dev, ops_on_device[dev])) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2721, in _GroupControlDeps 
    return no_op(name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_control_flow_ops.py", line 186, in no_op 
    result = _op_def_lib.apply_op("NoOp", name=name) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op 
    op_def=op_def) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2199, in create_op 
    self._check_not_finalized() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1925, in _check_not_finalized 
    raise RuntimeError("Graph is finalized and cannot be modified.") 
RuntimeError: Graph is finalized and cannot be modified. 
+0

http://stackoverflow.com/a/4332534 8/6521116 –

回答

7

的根本原因你的錯誤似乎是MonitoredTrainingSession已經完成(凍結)圖表和您的tf.global_variable_initializer()不再能夠修改它。

話雖如此,有一些需要注意多件事情:

1)你爲什麼在這裏嘗試多次初始化所有變量?

while not sess.should_stop(): 
    sess.run(tf.global_variables_initializer()) 

2)看起來有些代碼已經包含在MonitoredTrainingSession中,例如, ChiefSessionCreator。您能否再請看代碼(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/monitored_session.py#L243)或搜索其示例用法並查看MonitoredTrainingSession應該如何使用?

+0

對不起,我對tensorflow很陌生,因此我的代碼可能非常粗糙。 1)我已經評論了初始化部分上面的while循環,所以它只運行一次。 2)即使在監視訓練中指定了配置之後,我也不確定是否需要chiefsessioncreator。 當我跑步時,它實際上給出了循環中的輸出252。但是當我停下來並再次運行它時,它顯示出: - [http://pastebin.com/Cgk4Z9Pc](http://pastebin.com/Cgk4Z9Pc) – itsamineral

+2

當您第二次運行它時,它會嘗試加載您的較早運行的檢查點,缺少global_step。看看這個線程(http://stackoverflow.com/questions/36113090/tensorflow-get-the-global-step-when-restoring-checkpoints)如何保存和恢復global_step。在這裏(https://github.com/tensorflow/tensorflow/blob/b00fc538638f87ac45be9105057b9865f0f9418b/tensorflow/python/training/monitored_session_test.py#L206)如何初始化一個。 – guinny

1

如果要在循環中初始化圖形,可以使用該函數在循環頂部創建新圖形。

import tensorflow as tf 

tf.reset_default_graph() 
tf.Graph().as_default() 
0

因爲你的目的是利用MonitoredTrainingSession讓你檢查點,使用比你的例子更簡單:

import tensorflow as tf 

global_step = tf.contrib.framework.get_or_create_global_step() 
x = tf.constant(2) 
y1 = x + 300 
y2 = x**2 
y3 = x * 5 
y0 = x - 66 
y = y0 + y1 + y2 + y3 
step = tf.assign_add(global_step, 1) 

with tf.train.MonitoredTrainingSession(checkpoint_dir='/tmp/checkpoints') as sess: 
    while not sess.should_stop(): 
     result, i = sess.run([y, step]) 
     print(result, i) 
  • 用於保存/恢復檢查點的鉤子是MonitoredTrainingSession爲您創建。
  • 如果通過save_checkpoint_secs,您可以更改10分鐘默認設置中檢查點的頻率。我發現更高的頻率是不值得的:保存檢查點不是免費的,所以非常頻繁的檢查點將最終放慢訓練速度。
  • ChiefSessionCreator和gRPC配置僅在分佈式運行時需要(請參閱here以獲得對這些概念的描述。與將操作分配給特定設備類似 - 確保在使用它之前確實需要這樣做,因爲它可能會減慢速度你不小心
  • 你並不需要包裝上的張量操作的結果與tf.Variable() - 他們已經是變量
  • 您可以通過save_summaries_steps用於監視與tensorboard訓練,但默認情況下是會發生的。無論如何每100個步驟