How to handle fields enclosed within quotes(CSV) i

2019-02-02 13:32发布

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?

If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.

I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?

CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';

Any suggestions would be appreciated. Thanks Jitendra

7条回答
【Aperson】
2楼-- · 2019-02-02 13:56

Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:

  1. Use a different field separator (e.g. a pipe).
  2. Write a custom InputFormat based on OpenCSV.

The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:

CREATE TABLE foo (
  col1 INT,
  col2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
查看更多
疯言疯语
3楼-- · 2019-02-02 13:57

Use the csv-serde-0.9.1.jar file in your hive query, see http://illyayalovyy.github.io/csv-serde/

add jar /path/to/jar_file

Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties
(
  "separatorChar" = "\;",
  "quoteChar" = "\"
) stored as textfile
tblproperties("skip.header.line.count"="1") ---to skip if have any header file
LOCATION 's3://emrTest/folder';
查看更多
我欲成王,谁敢阻挡
4楼-- · 2019-02-02 14:04

Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex.

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

查看更多
We Are One
5楼-- · 2019-02-02 14:06

If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.

But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:

CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';
查看更多
淡お忘
6楼-- · 2019-02-02 14:06

Following code solved same type of problem

CREATE TABLE TableRowCSV2(    
    CODE STRING,        
    PRODUCTCODE STRING, 
    PRICE STRING     
)
    COMMENT 'row data csv'    
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (
   "separatorChar" = "\,",
   "quoteChar"     = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
查看更多
▲ chillily
7楼-- · 2019-02-02 14:10

I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.

So I have searched with links and I have found perfect solution for this.

We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar

then follow below steps using hive prompt :

add jar path/to/csv-serde.jar;

create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;

and then load data from your given path using below query:

load data local inpath 'path/xyz.csv' into table employee1;

and then run :

select * from employee1;

Now you will see the magic. Thanks.

查看更多
登录 后发表回答