I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the
gluepython
interpreter (not plainpython
).e.g.
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with
scp
or whatever):If this is a script that uses the
Job
class (as the auto-generated Python scripts do), you may need to pass--JOB_NAME
and--TempDir
parameters.For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards