Hive table data load with default SerDE

2019-09-16 15:36发布

问题:

The data is in the following format

a,"b,c",d, e

p,q,"e,r", t

a,s,"t,g", t

I wanted to create a Hive table

Col1, Col2, Col3 , Col4

a , b,c , d , e

p , q , e,r , t

a , s , t,g , t

As you see above, if the data is encapsulated in double quotes, the comma in between should not be considered in creating the table data. If I use the default SerDe, the double quotes are ignored and b,c is considered as two separate columns.

How do I ensure that the double quote will ignore the comma in between two elements if encapsulated in double quotes

回答1:

If it is possible and feasible, I'd first recommend you explore seeing if the input data can be sanitized in such a way to use something besides commas for field delimiters. It is always dicey to use a delimiting pattern that may naturally occur in your data.

But if that is not possible, then there is this regular expression-based way to detect quoted commas:

  1. First ingest your data into a staging table as single-column rows (the entire line into each row).
  2. Detect commas occurring between quotation marks and replace them with an artificial place holder.
  3. Split the resulting string using commas as delimiters.
  4. Replace the artificial placeholders with the commas that they originally represented.

As a contrived concrete example, I loaded up the following single-column staging table with your data (step #1):

hive> DESCRIBE staging;
OK
rawline                     string
Time taken: 0.238 seconds, Fetched: 1 row(s)
hive> SELECT * FROM staging;
OK
a,"b,c",d, e
p,q,"e,r", t
a,s,"t,g", t
Time taken: 0.277 seconds, Fetched: 3 row(s)

The following query then generates the final target table.

DROP TABLE IF EXISTS test;
CREATE TABLE test (
    Col1 STRING,
    Col2 STRING,
    Col3 STRING,
    Col4 STRING
  );
INSERT INTO TABLE test SELECT
  regexp_replace(fields[0], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[1], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[2], "\\[QUOTEDCOMMA\\]", ","),  -- Step #4
  regexp_replace(fields[3], "\\[QUOTEDCOMMA\\]", ",")   -- Step #4
FROM (
  SELECT split(  -- Step #2 and #3
    regexp_replace(rawline, "\"([^,]*),([^,]*)\"", "$1[QUOTEDCOMMA]$2"),
    ',') AS fields
  FROM staging
) t;

This generates the following final table test:

hive> SELECT * FROM test;
OK
a   b,c     d        e
p   q       e,r      t
a   s       t,g      t
Time taken: 0.196 seconds, Fetched: 3 row(s)

In this sample implementation, the string [QUOTEDCOMMA] is being used as the artificial placeholder for a comma that is found between quotation marks. The choice was completely arbitrary, and in practice if you go this route you will want to make sure your placeholder does not naturally occur within your data.



回答2:

The default text serde in Hive (LazySimple) doesn't support proper CSV semantics. The good news is that in the latest version of hive to date - 0.14.0 - there's a new serde that takes care of exactly that. If you happen to use this version, you can use the CSV serde and specify a double quote as the quote character - leading to proper parsing of the data, as you specify in your question.

Information about the serde and how to use it: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde



标签: hadoop hive