How to run glue script from Glue Dev Endpoint

2019-02-15 19:02发布

问题:

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?

I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.

回答1:

You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).

e.g.

radix@localhost:~$ DEV_ENDPOINT=glue@ec2-w-x-y-z.compute-1.amazonaws.com
radix@localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix@localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue@ip-w-x-y-z ~]$ gluepython myscript.py

You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):

radix@localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py

If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.



回答2:

For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.

After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.

Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).

Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.

According to AWS Glue FAQ:

Q: When should I use AWS Glue vs. Amazon EMR?

AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.

Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.

Regards