Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?
For a non-kerberized cluster you can add a Spark conf as:
Another (much safer) approach is to use
proxy authentication
- basically you create a service account and then allow it to impersonate to other users.Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created
svc_spark_prd
service account/ user.hadoop.proxyuser.svc_spark_prd.hosts
- list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications.*
is allowed but nor recommended for any host.Also specify either
hadoop.proxyuser.svc_spark_prd.users
orhadoop.proxyuser.svc_spark_prd.groups
to list users or groups thatsvc_spark_prd
is allowed to impersonate.*
is allowed but not recommended for any user/group.Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.
If your user exists, you can still launch your spark submit with su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer: how to run script as another user without password
For a non-kerberized cluster:
export HADOOP_USER_NAME=zorro
before submitting the Spark job will do the trick.Make sure to
unset HADOOP_USER_NAME
afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...