可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma.
While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?
If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.
I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';
Any suggestions would be appreciated.
Thanks
Jitendra
回答1:
If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.
But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';
回答2:
I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Now you will see the magic. Thanks.
回答3:
Following code solved same type of problem
CREATE TABLE TableRowCSV2(
CODE STRING,
PRODUCTCODE STRING,
PRICE STRING
)
COMMENT 'row data csv'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
回答4:
Hive now includes an OpenCSVSerde
which will properly parse those quoted fields without adding additional jars or error prone and slow regex.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
回答5:
Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:
- Use a different field separator (e.g. a pipe).
- Write a custom InputFormat based on OpenCSV.
The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:
CREATE TABLE foo (
col1 INT,
col2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
回答6:
Use the csv-serde-0.9.1.jar
file in your hive query, see
http://illyayalovyy.github.io/csv-serde/
add jar /path/to/jar_file
Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties
(
"separatorChar" = "\;",
"quoteChar" = "\"
) stored as textfile
tblproperties("skip.header.line.count"="1") ---to skip if have any header file
LOCATION 's3://emrTest/folder';
回答7:
There can be multiple solutions to this problem.
- Write custom SerDe class
- Use RegexSerde
- Remove escaped delimiter chars from data
Read more at http://grokbase.com/t/hive/user/117t2c6zhe/urgent-hive-not-respecting-escaped-delimiter-characters