TensorFlow中的內存泄漏Google Cloud ML Training

我一直在Google Cloud ML上試用TensorFlow教程腳本。特別是我使用了cifar10 CNN教程腳本https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10。TensorFlow中的內存泄漏Google Cloud ML Training

當我在Google Cloud ML中運行此練習腳本時，每小時內存泄漏率約爲0.5％。

除了將它們打包爲所需的GCP格式（如https://cloud.google.com/ml-engine/docs/how-tos/packaging-trainer中所述）並將數據位置設置爲包含.bin數據文件的存儲分區之外，我沒有對腳本進行任何更改。

如果我在谷歌雲，本地運行，即不能並用TCMALLOC，通過設置LD_PRELOAD = 「/ usr/lib中/ libtcmalloc.so」，內存泄漏已得到解決。但是，我沒有Google Cloud ML的此選項。

什麼可能導致泄漏，我能做些什麼來解決這個問題？爲什麼其他用戶不會注意到同樣的問題？雖然泄漏很小，但當我運行幾天的自己的數據時，它足以導致我的訓練會耗盡內存並失敗。無論我使用多少個GPU，都會發生泄漏。

我使用的gcloud指令是：

gcloud ml-engine jobs submit training cifar10_job --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.cifar10_multi_gpu_train --region europe-west1 --staging-bucket gs://tfoutput --scale-tier CUSTOM --config config.yml --runtime-version 1.0 -- --num_gpus=4

的配置文件（config.yml）爲：

trainingInput: 
    scaleTier: CUSTOM 
    masterType: complex_model_m_gpu

任何幫助理解，感謝。

來源

2017-06-07 Chris

您可以在本地分享從google.protobuf.internal import api_implementation; print（api_implementation._default_implementation_type）'運行'python -c「的輸出嗎？它是'cpp'嗎？ – rhaertel80

@ rhaertel80是的，它是'cpp' – Chris

，它匹配CloudML引擎中的輸出。我們會繼續調查。 – rhaertel80

我們建議使用此版本的代碼：

github.com/tensorflow/models/pull/1538

其中有性能優勢（通過縮短運行時間，你就不太容易奧姆斯）。

當然，這可能不是永久性修復，但是，根據我們的測試，TensorFlow 1.2似乎解決了這個問題。 TensorFlow 1.2即將在CloudML Engine上發佈。如果您仍然有問題，請告訴我們。

來源

2017-06-23 23:27:51 rhaertel80

TensorFlow中的內存泄漏Google Cloud ML Training

回答

相關問題