Spark EC2 SSH connection error SSH return code 255

2020-08-10 19:15发布

问题:

Every time I try to start a Spark cluster on AWS via the Spark ec2/spark_ec2.py file I get an SSH connection error that eventually gets resolved but wastes a lot of time.

Before you mark this as a duplicate I'm aware there quite a few similar questions asked but there are two key distinctions: a) my connection always completes (eventually) and I end up with a healthy Spark cluster and b) the "answers" for the other questions are generally centered around previous Spark versions (e.g., 1.2, 1.3, etc.). I have always experienced this issue going back 12 months ago w/1.3 through today with 1.6.1.

Thanks in advance!

Terminal Output:

Launched master in us-east-1e, regid = r-a1b2c3d4
Waiting for AWS to propagate instance metadata...
Waiting for cluster to enter 'ssh-ready' state...........

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.

Warning: SSH connection error. (This could be temporary.)
Host: ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
SSH return code: 255
SSH output: ssh: connect to host ec2-xx-xx-xx-xxx.compute-1.amazonaws.com port 22: Connection refused

.
Cluster is now in 'ssh-ready' state. Waited 833 seconds.
Generating cluster's SSH key on master...

回答1:

Please check if your security group in EC2 has ssh port(22) open.



回答2:

Check this, you must enable inbound ssh traffic



回答3:

Please confirm, that Key pair name on client and on destination machine matches.

On client it is probably stored in ~/.ssh in a pem file. On destination host it can be seen in EC2 console (click instance, next Description tab).

Different way to check it: start a new EC2 instance with the same keypair and log using corresponding pem file.

Mind also Security groups.



回答4:

The spark-ec2 scripts build AMIs based on the Amazon Linux base AMI:

# Creates an AMI for the Spark EC2 scripts starting with a stock Amazon 
# Linux AMI.
# This has only been tested with Amazon Linux AMI 2014.03.2 

I therefore believe that the delay in SSH connectivity / slow start up is due to the EC2 instance applying (or attempting to and timing out, depending on VPC configuration) critical patches / security updates on creation, as detailed in the Amazon Linux AMI FAQ:

On first boot, the Amazon Linux AMI installs from the package repositories any user space security updates that are rated critical or important, and it does so before services, such as SSH, start.

If the AMI cannot access the yum repositories, it will timeout and retry multiple times before completing the boot procedure. Possible reasons for this are restrictive firewall settings or VPC settings, which prevent access to the Amazon Linux AMI package repositories.

If this is indeed the case, then creating your own AMI from a VM that has all of the relevant updates applied and calling the script with the --ami option should resolve the problem (this can be automated to keep on top of everything).

One could potentially test this first by disabling the security update process, as per the FAQ:

To disable the security update on boot from the AWS EC2 Console:

On the "Advanced Instance Options" page in the Request Instances Wizard, there is a text field for sending the Amazon Linux AMI user-data. This data can be entered as text, or uploaded as a file. In either case, the data should be:

#cloud-config
repo_upgrade: none

To disable the security update on boot from the command line:

Create a text file with the preceding user-data, and pass it to aws ec2 run-instances with the --user-data file://<filename> flag (this can also be done with ec2-run-instances -f).

To disable the security update on boot when rebundling the Amazon Linux AMI:

Modify /etc/cloud/cloud.cfg and change repo_upgrade: security to repo_upgrade: none.