Query results difference between EMR-Presto and At

2019-07-26 17:22发布

I have connected Glue catalog to Athena and an EMR instance (with presto installed). I tried running the same query on both but am getting different results. EMR is giving 0 rows but Athena is giving 43 rows. The query is pretty simple with a left join, group by and a count distinct. The query looks like this:

select
  t1.customer_id as id,
  t2.purchase_date as purchase_date,
  count(distinct t1.purchase_id) as item_count
from 
  table1 t1
left join
  table2 as t2
  on t2.purchase_id=t1.purchase_id
where 
  t1.item_type='ABC' 
  and t1.purchase_status='CONFIRMED' 
  and t1.region_id in ('A','B','C')
  and t2.status='Dispatched'
  and t2.purchase_date between date_add('day',-50,date('2018-09-13')) and date('2018-09-13')
  and t1.created_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
  and t1.updated_at between date_add('day',-60,date('2018-09-13')) and date('2018-09-13')
group by
  t1.customer_id,t2.purchase_date;

I tried some other queries but the results completely match. Not sure what is wrong with this query.

EMR Version: 5.17.0
Presto Version: 0.206

EDIT: I realised that the issue is in the first table itself. Presto-EMR is not able to find any rows in table1 for some reason. Not sure why this would happen since both Presto-EMR and Athena are using the same Glue catalog. I also tried Hive in the same EMR instance and it is able to find rows in table1.

select * from table1 limit 10;

The above statement gives 10 rows with hive-sql but zero rows with presto-sql. I see the following exception in debug mode:

Query 20180917_075536_00023_4988g failed: com.facebook.presto.spi.type.TimestampType
java.lang.UnsupportedOperationException: com.facebook.presto.spi.type.TimestampType
    at com.facebook.presto.spi.type.AbstractType.writeSlice(AbstractType.java:135)
    at com.facebook.presto.hive.parquet.reader.ParquetBinaryColumnReader.readValue(ParquetBinaryColumnReader.java:55)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.lambda$readValues$1(ParquetPrimitiveColumnReader.java:184)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.processValues(ParquetPrimitiveColumnReader.java:204)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readValues(ParquetPrimitiveColumnReader.java:183)
    at com.facebook.presto.hive.parquet.reader.ParquetPrimitiveColumnReader.readPrimitive(ParquetPrimitiveColumnReader.java:171)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:208)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:258)
    at com.facebook.presto.hive.parquet.reader.ParquetReader.readBlock(ParquetReader.java:241)
    at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:244)
    at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:222)
    at com.facebook.presto.spi.block.LazyBlock.assureLoaded(LazyBlock.java:262)
    at com.facebook.presto.spi.block.LazyBlock.getLoadedBlock(LazyBlock.java:253)
    at com.facebook.presto.spi.Page.getLoadedPage(Page.java:247)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:245)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

1条回答
孤傲高冷的网名
2楼-- · 2019-07-26 18:26

Presto by default matches fields in Parquet with table schema based on position. If order of your fields varies (eg. has been written differently over the time), you need to enable matching by name. You can do this with hive.properties:

hive.parquet.use-column-names = true

or on the session level:

set session hive.parquet_use_column_names = true;

Here is a related issue: https://github.com/prestodb/presto/issues/8911

查看更多
登录 后发表回答