I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join
, group by
and a count distinct
. The query looks like this:
select
t1.customer_id as id,
t2.purchase_date as purchase_date,
count(distinct t1.purchase_id) as item_count
from
table1 t1
left join
table2 as t2
on t2.purchase_id=t1.purchase_id
where
t1.item_type='ABC'
and t1.purchase_status='CONFIRMED'
and t1.region_id in ('A','B','C')
and t2.status='Dispatched'
and t2.purchase_date between date_add('day',-50,date('2018-09-13')) and date('2018-09-13')
and t1.created_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
and t1.updated_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
group by
t1.customer_id,t2.purchase_date;
I tried some other queries but the results completely match. Not sure what is wrong with this query.
EMR Version: 5.17.0
Presto Version: 0.206
EDIT: I realised that the issue is in the first table itself. Presto-EMR is not able to find any rows in table1
for some reason. Not sure why this would happen since both Presto-EMR and Athena are using the same Glue catalog. I also tried Hive in the same EMR instance and it is able to find rows in table1
.
select * from table1 limit 10;
The above statement gives 10 rows with hive-sql but zero rows with presto-sql. I see the following exception in debug mode:
Query 20180917_075536_00023_4988g failed: com.facebook.presto.spi.type.TimestampType
java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.TimestampType
at com.facebook.presto.spi.type.AbstractType.writeSlice(AbstractType.java:135)
at com.facebook.presto.hive.parquet.reader.ParquetBinaryColumnReader.readValue(ParquetBinaryColumnReader.java:55)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:208)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:258)
at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:241)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:244)
at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:222)
at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:262)
at com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:253)
at com.facebook.presto.spi.Page.getLoadedPage(Page.java:247)
at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:245)
at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Presto by default matches fields in Parquet with table schema based on position. If order of your fields varies (eg. has been written differently over the time), you need to enable matching by name. You can do this with
hive.properties
:or on the session level:
set session hive.parquet_use_column_names = true;
Here is a related issue: https://github.com/prestodb/presto/issues/8911