I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I have followed the steps but I am facing some discrepancy with regards to accessing the Glue Catalog DB/Tables. Both EMR Cluster & AWS Glue are in the same account and appropriate IAM permissions have been provided.
AWS Documentation : https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html
Observations:
- Using spark-shell (From EMR Master Node):
- Works. Able to access Glue DB/Tables using below commands:
spark.catalog.setCurrentDatabase("test_db") spark.catalog.listTables
- Using spark-submit (From EMR Step):
- Does not work. Keep getting the error "Database 'test_db' does not exist"
Error Trace is as below:
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO HiveMetaStore: 0: get_database: default
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
INFO HiveMetaStore: 0: get_database: global_temp
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: global_temp
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/6d0f6b2c-cccd-4e90-a524-93dcc5301e20_resources
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created local directory: /mnt3/yarn/usercache/hadoop/appcache/application_1547055968446_0005/container_1547055968446_0005_01_000001/tmp/yarn/6d0f6b2c-cccd-4e90-a524-93dcc5301e20
INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/6d0f6b2c-cccd-4e90-a524-93dcc5301e20/_tmp_space.db
INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs:///user/spark/warehouse
INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
INFO CodeGenerator: Code generated in > 191.063411 ms
INFO CodeGenerator: Code generated in 10.27313 ms
INFO HiveMetaStore: 0: get_database: test_db
INFO audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: test_db
WARN ObjectStore: Failed to get database test_db, returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Database 'test_db' does not exist.; at org.apache.spark.sql.internal.CatalogImpl.requireDatabaseExists(CatalogImpl.scala:44) at org.apache.spark.sql.internal.CatalogImpl.setCurrentDatabase(CatalogImpl.scala:64) at org.griffin_test.GriffinTest.ingestGriffinRecords(GriffinTest.java:97) at org.griffin_test.GriffinTest.main(GriffinTest.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
After lot of research and going through many suggestions in blogs, I have tried the below fixes but of no avail and we are still facing the discrepancy.
Reference Blogs:
- https://forums.aws.amazon.com/thread.jspa?threadID=263860
- Spark Catalog w/ AWS Glue: database not found
- https://okera.zendesk.com/hc/en-us/articles/360005768434-How-can-we-configure-Spark-to-use-the-Hive-Metastore-for-metadata-
Fixes Tried:
- Enabling Hive support in spark-defaults.conf & SparkSession (Code):
Hive classes are on CLASSPATH and have set spark.sql.catalogImplementation internal configuration property to hive:
spark.sql.catalogImplementation hive
Adding Hive metastore config:
.config("hive.metastore.connect.retries", 15) .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
Code Snippet:
SparkSession spark = SparkSession.builder().appName("Test_Glue_Catalog")
.config("spark.sql.catalogImplementation", "hive")
.config("hive.metastore.connect.retries", 15)
.config("hive.metastore.client.factory.class","com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
Any suggestions in figuring out the root cause for this discrepancy would be really helpful.
Appreciate your help! Thank you!