2017-01-10 98 views
2

我正在關注谷歌雲ml上重新培訓開始的花教程。我可以運行教程,訓練,預測,很好。重新培訓創始谷歌雲陷入全球第一步0

然後我用我自己的測試數據集替換了花朵數據集。圖像數字的光學字符識別。

enter image description here

我完整的代碼here

字典文件labels

評估和演示set

培訓Set

從谷歌提供最近泊塢窗內部版本。

`docker run -it -p "127.0.0.1:8080:8080" --entrypoint=/bin/bash gcr.io/cloud-datalab/datalab:local-20161227 

我可以預處理文件,並提交使用

# Submit training job. 
gcloud beta ml jobs submit training "$JOB_ID" \ 
    --module-name trainer.task \ 
    --package-path trainer \ 
    --staging-bucket "$BUCKET" \ 
    --region us-central1 \ 
    -- \ 
    --output_path "${GCS_PATH}/training" \ 
    --eval_data_paths "${GCS_PATH}/preproc/eval*" \ 
    --train_data_paths "${GCS_PATH}/preproc/train*" 

培訓工作,但它永遠不會使得過去全球一步0花教程大約〜1小時的培訓上自由層。我已經讓訓練持續了11個小時。沒有運動。

enter image description here

縱觀在爲Stackdriver,沒有什麼進展。

enter image description here

我也曾嘗試20幅訓練圖像,以及10個EVAL圖像的微小玩具的數據集。同樣的問題。

的GCS桶最終看起來像這樣 enter image description here

也許並不奇怪,我不能想像這個日誌中tensorboard,沒有顯示。

完整的訓練日誌:

INFO 2017-01-10 17:22:00 +0000  unknown_task   Validating job requirements... 
INFO 2017-01-10 17:22:01 +0000  unknown_task   Job creation request has been successfully validated. 
INFO 2017-01-10 17:22:01 +0000  unknown_task   Job MeerkatReader_MeerkatReader_20170110_170701 is queued. 
INFO 2017-01-10 17:22:07 +0000  unknown_task   Waiting for job to be provisioned. 
INFO 2017-01-10 17:22:07 +0000  unknown_task   Waiting for TensorFlow to start. 
INFO 2017-01-10 17:22:10 +0000  master-replica-0    Running task with arguments: --cluster={"master": ["master-d4f6-0:2222"]} --task={"type": "master", "index": 0} --job={ 
INFO 2017-01-10 17:22:10 +0000  master-replica-0     "package_uris": ["gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz"], 
INFO 2017-01-10 17:22:10 +0000  master-replica-0     "python_module": "trainer.task", 
INFO 2017-01-10 17:22:10 +0000  master-replica-0     "args": ["--output_path", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training", "--eval_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval*", "--train_data_paths", "gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train*"], 
INFO 2017-01-10 17:22:10 +0000  master-replica-0     "region": "us-central1" 
INFO 2017-01-10 17:22:10 +0000  master-replica-0    } --beta 
INFO 2017-01-10 17:22:10 +0000  master-replica-0    Downloading the package: gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz 
INFO 2017-01-10 17:22:10 +0000  master-replica-0    Running command: gsutil -q cp gs://api-project-773889352370-ml/MeerkatReader_MeerkatReader_20170110_170701/f78d90a60f615a2d108d06557818eb4f82ffa94a/trainer-0.1.tar.gz trainer-0.1.tar.gz 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    Building wheels for collected packages: trainer 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    creating '/tmp/tmpSgdSzOpip-wheel-/trainer-0.1-cp27-none-any.whl' and adding '.' to it 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer/model.py' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer/util.py' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer/preprocess.py' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer/task.py' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer-0.1.dist-info/metadata.json' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer-0.1.dist-info/WHEEL' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    adding 'trainer-0.1.dist-info/METADATA' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0     Running setup.py bdist_wheel for trainer: finished with status 'done' 
INFO 2017-01-10 17:22:12 +0000  master-replica-0     Stored in directory: /root/.cache/pip/wheels/e8/0c/c7/b77d64796dbbac82503870c4881d606fa27e63942e07c75f0e 
INFO 2017-01-10 17:22:12 +0000  master-replica-0    Successfully built trainer 
INFO 2017-01-10 17:22:13 +0000  master-replica-0    Running command: python -m trainer.task --output_path gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/training --eval_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/eval* --train_data_paths gs://api-project-773889352370-ml/MeerkatReader/MeerkatReader_MeerkatReader_20170110_170701/preproc/train* 
INFO 2017-01-10 17:22:14 +0000  master-replica-0    Starting master/0 
INFO 2017-01-10 17:22:14 +0000  master-replica-0    Initialize GrpcChannelCache for job master -> {0 -> localhost:2222} 
INFO 2017-01-10 17:22:14 +0000  master-replica-0    Started server with target: grpc://localhost:2222 
ERROR 2017-01-10 17:22:16 +0000  master-replica-0    device_filters: "/job:ps" 
INFO 2017-01-10 17:22:19 +0000  master-replica-0    global_step/sec: 0 

只是重複的最後一行,直到我殺了它。

我的這項服務的心智模式是不正確的?所有的建議歡迎。

回答

2

一切看起來不錯。我的懷疑是你的數據有問題。具體而言,我懷疑TF無法從您的GCS文件中讀取任何數據(它們是否爲空?)?因此,當你調用train時,TF最終會阻止嘗試讀取它不能完成的一批數據。

我建議加入呼叫Trainer.run_training到session.run周圍日誌記錄語句。這會告訴你這是否是卡住的線路。

我也建議檢查你的GCS文件的大小。

TensorFlow也有一個實驗RunOptions它允許你指定Session.run超時。一旦這個功能準備就緒,這可能對確保代碼不會永久封鎖有用。

+0

GCS文件是'空的',它們存在但只有20個字節,而花卉教程中的每個.gz約20-50kb。我不清楚是什麼導致preprocess.py失敗(也許我應該用正確的標籤打開一個新問題)。 – bw4sz

+0

確認。爲了將來的參考,這是如果eval.csv中的路徑錯誤會發生什麼。我在存儲桶名稱中加了一個額外的斜槓。 – bw4sz