The data is in the following format
a,"b,c",d, e
p,q,"e,r", t
a,s,"t,g", t
I wanted to create a Hive table
Col1, Col2, Col3 , Col4
a , b,c , d , e
p , q , e,r , t
a , s , t,g , t
As you see above, if the data is encapsulated in double quotes, the comma in between should not be considered in creating the table data. If I use the default SerDe, the double quotes are ignored and b,c is considered as two separate columns.
How do I ensure that the double quote will ignore the comma in between two elements if encapsulated in double quotes
The default text serde in Hive (LazySimple) doesn't support proper CSV semantics. The good news is that in the latest version of hive to date - 0.14.0 - there's a new serde that takes care of exactly that. If you happen to use this version, you can use the CSV serde and specify a double quote as the quote character - leading to proper parsing of the data, as you specify in your question.
Information about the serde and how to use it: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
If it is possible and feasible, I'd first recommend you explore seeing if the input data can be sanitized in such a way to use something besides commas for field delimiters. It is always dicey to use a delimiting pattern that may naturally occur in your data.
But if that is not possible, then there is this regular expression-based way to detect quoted commas:
As a contrived concrete example, I loaded up the following single-column staging table with your data (step #1):
The following query then generates the final target table.
This generates the following final table
test
:In this sample implementation, the string
[QUOTEDCOMMA]
is being used as the artificial placeholder for a comma that is found between quotation marks. The choice was completely arbitrary, and in practice if you go this route you will want to make sure your placeholder does not naturally occur within your data.