蜂巢负载CSV在引述领域逗号(Hive load CSV with commas in quoted

2019-07-04 06:19发布

我想一个CSV文件加载到一个蜂巢表所示:

CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;    


CSV文件是由逗号分隔(,),看起来像这样:

1, "some text, with comma in it", 123, "more text"

这将第一返回字符串中损坏的数据,因为有一个“”。
有没有一种方法来设置一个文本分隔符或使蜂巢忽略“”在弦?

因为它会从外部源拉我不能改变的CSV的分隔符。

Answer 1:

问题是, Hive不处理报文。 您可能需要通过更改字段之间的分隔符预先处理的数据(例如:用Hadoop的工作流),或者你也可以给一个尝试使用自定义CSV SERDE它使用OpenCSV解析文件。



Answer 2:

如果可以重新创建或解析您的输入数据,你可以指定CREATE TABLE转义字符:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';

将接受该行的4个领域

1,some text\, with comma in it,123,more text


Answer 3:

由于蜂巢0.14时,CSV SERDE是蜂巢的一个标准部分安装

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

(参见: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde )



Answer 4:

保持单引号,将工作分隔符。

ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';

这将工作



Answer 5:

添加在TERMINATED田野反斜杠“\;”

例如:

CREATE  TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS 
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag 
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;

我已经测试过它,和它的工作。



文章来源: Hive load CSV with commas in quoted fields