Why can't Impala read parquet files after Spar

2019-02-20 12:37发布

问题:

Having some issues with the way that Spark is interpreting columns for parquet.

I have an Oracle source with confirmed schema (df.schema() method):

root
  |-- LM_PERSON_ID: decimal(15,0) (nullable = true)
  |-- LM_BIRTHDATE: timestamp (nullable = true)
  |-- LM_COMM_METHOD: string (nullable = true)
  |-- LM_SOURCE_IND: string (nullable = true)
  |-- DATASET_ID: decimal(38,0) (nullable = true)
  |-- RECORD_ID: decimal(38,0) (nullable = true)

Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):

  message spark_schema {
    optional int64 LM_PERSON_ID (DECIMAL(15,0));
    optional int96 LM_BIRTHDATE;
    optional binary LM_COMM_METHOD (UTF8);
    optional binary LM_SOURCE_IND (UTF8);
    optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
    optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}

My application then generates the table DDL using a HashMap for type conversion, for example:

CREATE EXTERNAL TABLE IF NOT EXISTS 
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'

My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.

Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'. 
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]

How do I know when to substitute a Decimal field for BIGINT?

The parquet message type is logged but not accessible?

Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64

The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.

What am I missing here? Can I enforce a schema for the parquet file for decimal?

回答1:

The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).