Using Redshift Spectrum to read the data in extern

2019-08-18 12:22发布

问题:

I did the below in AWS Redshift cluster to read the Parquet file from S3.

create external schema s3_external_schema 
from data catalog 
database 'dev' 
iam_role 'arn:aws:iam::<MyuniqueId>:role/<MyUniqueRole>'
create external database if not exists;

then

CREATE external table s3_external_schema.SUPPLIER_PARQ_1 (
S_SuppKey BIGINT ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105))
partitioned by (Supplier bigint, nation int)
stored as PARQUET
location 's3://<My Bucket>/<File partition>/';

Both the above create statements were successful. When the ran the below query it is returning 0.

select * from s3_external_schema."supplier_parq_1"
limit 10;

No rows returned.

Am I missing any grants or access permission to make it return the query output.

回答1:

Steps to debug a non-working Redshift-Spectrum query

  1. try same query using athena: easiest way is to run a glue crawler against the s3 folder, it should create a hive metastore table that you can straight away query (using same sql as you have already) in athena.
  2. use s3 select : using aws console navigate to one of your parquet files, right click and "select", then click parquet (may be already defaulted) and see what you get.

If 1. or 2. are working, then it is likely your definition is not right in spectrum. instead just use the definition as created within the hive metastore like this.

create external schema some_schema from data catalog
database 'the_name_you_gave_the_hive_db'
iam_role 'whatever'
create external database if not exists;

You can then just use the newly defined redshift spectrum schema without further definition.



回答2:

In addition to what Jon Scott said, if an external table is created with partitions unless one runs ALTER TABLE to add partitions, no data will be shown

What needs to be run after the table is created and subsequent partitions are added is

ALTER TABLE s3_external_schema.SUPPLIER_PARQ_1 ADD PARTITION ('...')
LOCATION 's3://<My Bucket>/<File partition>/'

To check the error messages for external table queries SVL_S3LOG table should be used.