handling newline character in hive

2019-01-26 19:59发布

问题:

I have created a table in hive as

Create table(id int, Description String)  

My data looks something as follows :

 
1|This will return corrupt data since there is a ',' in the first string.
     some text
     Change the data  
2|There is prob in reading data 
    sometext

After the data is loaded into hive since the default line terminator is \n, the description column cannot be read by hive, Hence it displays a NULL value. Can anyone suggest how to handle newline before loading into hive.

回答1:

I know this question is old, but you have a couple of options. You can't control this with FIELDS TERMINATED BY, because that only controls what terminates the fields, not the records. Records in Hive are hard-coded to be terminated by the newline character (even though there is a LINES TERMINATED BY clause, it is not implemented).

  1. Write a custom InputFormat that uses a RecordReader that understands non-newline delimited records. Look at the code for LineReader/LineRecordReader and TextInputFormat.
  2. Use a format other than text/ASCII, like Parquet. I would recommend this regardless, as text is probably the worst format you can store data in anyway.


回答2:

try adding the below property in hive-site.xml or you can just try for temporary hive session level.

hive.query.result.fileformat=SequenceFile



回答3:

By default hive takes in NEWLINE ('\N') as delimiter . You can change the delimiter using:

    ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";


标签: hadoop hive