Using FileFormat v Serde to read custom text files

2019-03-20 05:36发布

问题:

Hadoop/Hive newbie here. I am trying to use data stored in a custom text-based format with Hive. My understanding is you can either write a custom FileFormat or a custom SerDe class to do that. Is that the case or am I misunderstanding it? And what are some general guidelines on which option to choose when? Thanks!

回答1:

I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends org.apache.hadoop.mapred.TextInputFormat) which returns a custom RecordReader (implements org.apache.hadoop.mapred.RecordReader<K, V>). The RecordReader implements logic to read and parse my files and returns tab delimited rows.

With that I declared my table as

create table t2 ( 
field1 string, 
..
fieldNN float)        
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'    
STORED AS INPUTFORMAT 'namespace.CustomFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.



回答2:

Basically you need to understand the difference that when to modify SerDe and and when to modify fileformat.

From official documentation: Hive SerDe

What is a SerDe? 1.SerDe is a short name for "Serializer and Deserializer." 2.Hive uses SerDe (and FileFormat) to read and write table rows. 3.HDFS files-->InputFileFormat--> --> Deserializer --> Row object 4.Row object -->Serializer --> --> OutputFileFormat --> HDFS files

So,3rd and 4th points are clearly inferring the difference. You need to have custom fileformat(input/output) when you want to read a record in a different way than usual(where records are separated by '\n'). And you need to have customize SerDe when you want to interpret the read records in a custom way.

Let's take an example of commonly used format JSON.

Scenario 1: Let's say you have an input json file where one line contains one json record. So,now you just needs Custom Serde to interpret the read record in a way you want. No need of custom inout format as 1 line will be 1 record.

Scenario 2: Now if you have an input file where your one json record spans across multiple lines and you want to read it as it is then you should first write a custom input format to read in 1 json record and then this read json record will go to Custom SerDe.



回答3:

Depends on what you're getting from your text file.

You can write a custom record reader to parse the text log file and return the way you want, Input format class does that job for you. You will use this jar to create the Hive table and load the data in that table.

Talking about SerDe, I use it a little differently. I use both InputFormat and SerDe, former to parse the actual data and latter to get stabilization to my metadata which represents actual data. Reason why I do that? I want to create appropriate columns(not more or less) in hive table for each row of my log file I have and I think SerDe is the perfect solution for me.

Eventually I map those two to create a final table if I want or keep those tables as it is so that I can do joins to query from those.

I like the explanation of Cloudera blog.

http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/



回答4:

If you're using Hive, write a serde. See these examples: https://github.com/apache/hive/tree/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2

Note that this interface is Hive specific. If you want to use your custom file format for regular hadoop jobs, you'll have to implement a separate interface (I'm not totally sure which one).

If you already know how to deserialize data in another language, you could just write a streaming job (using any language) and use your existing libraries.

Hope that helps



标签: hive