I am new to distributed tensorflow. I found this distributed mnist test in here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py
But I don't know how to make it run. I used the following script:
python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=0 --worker_grpc_url="grpc://tf-worker0:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=1 --worker_grpc_url="grpc://tf-worker1:2222"\
& python distributed_mnist.py --num_workers=3 --num_parameter_servers=1 --worker_index=2 --worker_grpc_url="grpc://tf-worker2:2222"
I just found these parameters are missing, so I pass them to the program. Here is what happened:
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/train-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-images-idx3-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Extracting /tmp/mnist-data/t10k-labels-idx1-ubyte.gz
Worker GRPC URL: grpc://tf-worker0:2222
Worker index = 0
Number of workers = 3
Worker GRPC URL: grpc://tf-worker2:2222
Worker index = 2
Number of workers = 3
Worker GRPC URL: grpc://tf-worker1:2222
Worker index = 1
Number of workers = 3
Worker 0: Initializing session...
Worker 2: Waiting for session to be initialized...
Worker 1: Waiting for session to be initialized...
E0608 20:37:13.514249023 7501 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.514287961 7501 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.548052986 7502 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.548091527 7502 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:13.555449386 7503 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:13.555473898 7503 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
^CE0608 20:37:28.517451603 7504 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.517491102 7504 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.551002331 7505 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.551029795 7505 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
E0608 20:37:28.556681378 7506 resolve_address_posix.c:126] getaddrinfo: Name or service not known
D0608 20:37:28.556709728 7506 dns_resolver.c:189] dns resolution failed: retrying in 15 seconds
Anyone know how to run it correctly? Thanks a lot!