Distributed Tensorflow in Kubeflow - NotFoundError

I follow the tutorial for building kubeflow on GCP.

At the last step, after deploying the code and training with CPU.

kustomize build . |kubectl apply -f -

The distributed tensorflow encounter this issue

tensorflow.python.framework.errors_impl.NotFoundError: /tmp/tmprIn1Il/model.ckpt-1_temp_a890dac1971040119aba4921dd5f631a; No such file or directory
[[Node: save/SaveV2 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:ps/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv_layer1/conv2d/bias, conv_layer1/conv2d/kernel, conv_layer2/conv2d/bias, conv_layer2/conv2d/kernel, dense/bias, dense/kernel, dense_1/bias, dense_1/kernel, global_step)]]

I found the similar bug report but don't know how to resolve this.

标签： tensorflow kubeflow

1条回答

Lonely孤独者°

2楼-- · 2019-08-30 09:29

From the bug report.

You can work around this problem by using a shared filesystem (e.g. HDFS, GCS, or an NFS mount at the same mount point) on the workers and the parameter servers.

Just put the data on GCS and it work fine.

model.py

import tensorflow_datasets as tfds
import tensorflow as tf

# tfds works in both Eager and Graph modes
tf.enable_eager_execution()

# See available datasets
print(tfds.list_builders())

ds_train, ds_test = tfds.load(name="mnist", split=["train", "test"], data_dir="gs://kubeflow-tf-bucket", batch_size=-1)
ds_train = tfds.as_numpy(ds_train)
ds_test = tfds.as_numpy(ds_test)

(x_train, y_train) = ds_train['image'], ds_train['label']
(x_test, y_test) = ds_test['image'], ds_test['label']
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
print(model.evaluate(x_test, y_test))

0人赞添加讨论(0) 举报

Distributed Tensorflow in Kubeflow - NotFoundError

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间