Hive External Table Skip First Row

2019-01-10 10:02发布

问题:

I am using Cloudera's version of Hive and trying to create an external table over a csv file that contains the column names in the first column. Here is the code that I am using to do that.

CREATE EXTERNAL TABLE Test ( 
  RecordId int, 
  FirstName string, 
  LastName string 
) 
ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' 
WITH SerDeProperties (  
  "separatorChar" = ","
) 
STORED AS TEXTFILE 
LOCATION '/user/File.csv'

Sample Data

RecordId,FirstName,LastName
1,"John","Doe"
2,"Jane","Doe"

Can anyone help me with how to skip the first row or do I need to add an intermediate step?

回答1:

Header rows in data are a perpetual headache in Hive. Short of modifying the Hive source, I believe you can't get away without an intermediate step. (Edit: This is no longer true, see update below)

Unfortunately, that answers you question. I'll throw in some ideas for the intermediate step for completeness.

You can get away without an extra step in your data load if you are willing to filter out the header row on every query that touches the table. Unfortunately this adds an extra set just about everywhere else. And you will have to get clever/messy when the header row violates your schema. If you go with this approach, you might consider writing a custom SerDe that makes this row easier to filter. Unfortunately, SerDe's cannot remove the row entirely (or that might form a possible solution), they must return something like null. I've never seen this approach taken in practice to deal with header rows since it makes reading a pain, and reading tends to be much more common than writing. It might have a place if you are dealing with one-of tables or if the header row is just one row among many malformed rows.

You could do this filtering once with variations on deleting that first row in data load. A WHERE clause in an INSERT statement would do it. You could use utilities like sed to get rid of it. I've seen both approaches taken. There are trade-offs between which approach you take and neither is the one true way to deal with header rows. Unfortunately, both these approaches take time and require temporary duplication of the data. If you absolutely need the header row for another application, the duplication would be permanent.

Update:

From Hive v0.13.0, you can use skip.header.line.count. You could also specify the same while creating the table. For example:

create external table testtable (name string, message string)
row format delimited 
fields terminated by '\t' 
lines terminated by '\n' 
location '/testtable'
tblproperties ("skip.header.line.count"="1");


回答2:

While you have your answer from Daniel, here are some customizations possible using OpenCSVSerde:

CREATE EXTERNAL TABLE `mydb`.`mytable`(
    `product_name` string,
    `brand_id` string,
    `brand` string,
    `color` string,
    `description` string,
    `sale_price` string)
PARTITIONED BY (
    `seller_id` string)
ROW FORMAT SERDE
    'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
    'separatorChar' = '\t',
    'quoteChar' = '"',
    'escapeChar' = '\\')
STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
    'hdfs://namenode.com:port/data/mydb/mytable'
TBLPROPERTIES (
    'serialization.null.format' = '',
    'skip.header.line.count' = '1')

With this, you have total control over the separator, quote character, escape character, null handling and header handling.

Look here and here.



回答3:

Just append below property in your query and the first header or line int the record will not load or it will be skipped.

Try this

tblproperties ("skip.header.line.count"="1");


回答4:

I am not quite sure if it works with ROW FORMAT serde 'com.bizo.hive.serde.csv.CSVSerde' but I guess that it should be similar to ROW FORMAT DELIMITED FIELDS TERMINATED BY ','.
In your case first row will be treated like normal row. But first field fails to be INT so all fields, for first row, will be set as NULL. You need only one intermediate step to fix it:

INSERT OVERWRITE TABLE Test
SELECT * from Test WHERE RecordId IS NOT NULL

Only one drawback is that your original csv file will be modified. I hope it helps. GL!



回答5:

I also struggled with this and found no way to tell hive to skip first row, like there is e.g. in Greenplum. So finally I had to remove it from the files. e.g. "cat File.csv | grep -v RecordId > File_no_header.csv"



回答6:

create external table table_name( 
Year int, 
Month int,
column_name data_type ) 
row format delimited fields terminated by ',' 
location '/user/user_name/example_data' TBLPROPERTIES('serialization.null.format'='', 'skip.header.line.count'='1');


回答7:

skip.header.line.count works but if you have some external tool accessing accessing this table , it will still see that actual data without skipping those lines



标签: hive Cloudera