Every time I try to start a Spark cluster on AWS via the Spark ec2/spark_ec2.py file I get an SSH connection error that eventually gets resolved but wastes a lot of time.
Before you mark this as a duplicate I'm aware there quite a few similar questions asked but there are two key distinctions: a) my connection always completes (eventually) and I end up with a healthy Spark cluster and b) the "answers" for the other questions are generally centered around previous Spark versions (e.g., 1.2, 1.3, etc.). I have always experienced this issue going back 12 months ago w/1.3 through today with 1.6.1.
Thanks in advance!
Terminal Output:
Launched master in us-east-1e, regid = r-a1b2c3d4
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state...........
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused
.
Cluster is now in 'ssh-ready' state. Waited 833 seconds.
Generating cluster's SSH key on master...
Please confirm, that Key pair name on client and on destination machine matches.
On client it is probably stored in ~/.ssh in a pem file. On destination host it can be seen in EC2 console (click instance, next Description tab).
Different way to check it: start a new EC2 instance with the same keypair and log using corresponding pem file.
Mind also Security groups.
Check this, you must enable inbound ssh traffic
Please check if your security group in EC2 has ssh port(22) open.
The spark-ec2 scripts build AMIs based on the Amazon Linux base AMI:
I therefore believe that the delay in SSH connectivity / slow start up is due to the EC2 instance applying (or attempting to and timing out, depending on VPC configuration) critical patches / security updates on creation, as detailed in the Amazon Linux AMI FAQ:
If this is indeed the case, then creating your own AMI from a VM that has all of the relevant updates applied and calling the script with the --ami option should resolve the problem (this can be automated to keep on top of everything).
One could potentially test this first by disabling the security update process, as per the FAQ: