Using FileFormat v Serde to read custom text files

Hadoop/Hive newbie here. I am trying to use data stored in a custom text-based format with Hive. My understanding is you can either write a custom FileFormat or a custom SerDe class to do that. Is that the case or am I misunderstanding it? And what are some general guidelines on which option to choose when? Thanks!

标签： hive

4条回答

萌系小妹纸

2楼-- · 2019-03-20 05:57

Basically you need to understand the difference that when to modify SerDe and and when to modify fileformat.

From official documentation: Hive SerDe

What is a SerDe? 1.SerDe is a short name for "Serializer and Deserializer." 2.Hive uses SerDe (and FileFormat) to read and write table rows. 3.HDFS files-->InputFileFormat--> --> Deserializer --> Row object 4.Row object -->Serializer --> --> OutputFileFormat --> HDFS files

So,3rd and 4th points are clearly inferring the difference. You need to have custom fileformat(input/output) when you want to read a record in a different way than usual(where records are separated by '\n'). And you need to have customize SerDe when you want to interpret the read records in a custom way.

Let's take an example of commonly used format JSON.

Scenario 1: Let's say you have an input json file where one line contains one json record. So,now you just needs Custom Serde to interpret the read record in a way you want. No need of custom inout format as 1 line will be 1 record.

Scenario 2: Now if you have an input file where your one json record spans across multiple lines and you want to read it as it is then you should first write a custom input format to read in 1 json record and then this read json record will go to Custom SerDe.

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-03-20 05:58

Depends on what you're getting from your text file.

You can write a custom record reader to parse the text log file and return the way you want, Input format class does that job for you. You will use this jar to create the Hive table and load the data in that table.

Talking about SerDe, I use it a little differently. I use both InputFormat and SerDe, former to parse the actual data and latter to get stabilization to my metadata which represents actual data. Reason why I do that? I want to create appropriate columns(not more or less) in hive table for each row of my log file I have and I think SerDe is the perfect solution for me.

Eventually I map those two to create a final table if I want or keep those tables as it is so that I can do joins to query from those.

I like the explanation of Cloudera blog.

http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

0人赞添加讨论(0) 举报

放荡不羁爱自由

4楼-- · 2019-03-20 06:05

If you're using Hive, write a serde. See these examples: https://github.com/apache/hive/tree/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2

Note that this interface is Hive specific. If you want to use your custom file format for regular hadoop jobs, you'll have to implement a separate interface (I'm not totally sure which one).

If you already know how to deserialize data in another language, you could just write a streaming job (using any language) and use your existing libraries.

Hope that helps

0人赞添加讨论(0) 举报

ら.Afraid

5楼-- · 2019-03-20 06:16

I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat) which returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V>). The RecordReader implements logic to read and parse my files and returns tab delimited rows.

With that I declared my table as

create table t2 ( 
field1 string, 
..
fieldNN float)        
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'    
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.

0人赞添加讨论(0) 举报

Using FileFormat v Serde to read custom text files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间