I try to create table from CSV file which is save into HDFS. The problem is that the csv consist line break inside of quote. Example of record in CSV:
ID,PR_ID,SUMMARY
2063,1184,"This is problem field because consists line break
This is not new record but it is part of text of third column
"
I created hive table:
CREATE TEMPORARY EXTERNAL TABLE hive_database.hive_table
(
ID STRING,
PR_ID STRING,
SUMMARY STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\""
)
stored as textfile
LOCATION '/path/to/hdfs/dir/csv'
tblproperties('skip.header.line.count'='1');
Then I try to count the rows (The correct result should by 1)
Select count(*) from hive_database.hive_table;
But the result is 4 what is incorrect. Do you have any idea how to solve it? Thanks all.
There is right now no way to handle multilines csv in hive directly. However, there is some workaround:
produce a csv with
\n
or\r\n
replaced with your own newline marker such<\br>
. You will be able to load it in hive. Then transform the resulting text by replacing the latter by the formeruse spark, it has a multiline csv reader. This works out the box, while the csv beeing not read in a distributed way.
use an other format such parquet, avro, orc, sequence file, instead of a csv. For example you could use sqoop to produce them from a jdbc database. Or you could write your own program in java or python.
I found the solution. You can define your own InputFormatter. Then the DDL for HQL table will looks like this (At first you need to add your custom jar file):
Then how to create the custom input formatter you can see for example here: https://analyticsanvil.wordpress.com/2016/03/06/creating-a-custom-hive-input-format-and-record-reader-to-read-fixed-format-flat-files/