Hadoop/Hive newbie here. I am trying to use data stored in a custom text-based format with Hive. My understanding is you can either write a custom FileFormat
or a custom SerDe
class to do that. Is that the case or am I misunderstanding it? And what are some general guidelines on which option to choose when? Thanks!
相关问题
-
hive: cast array
> into map - Find function in HIVE
- Hive Tez reducers are running super slow
- Set parquet snappy output file size is hive?
- Hive 'cannot alter table' error
相关文章
- 在hive sql里怎么把"2020-10-26T08:41:19.000Z"这个字符串转换成年月日
- SQL query Frequency Distribution matrix for produc
- Cloudera 5.6: Parquet does not support date. See H
- converting to timestamp with time zone failed on A
- Hive error: parseexception missing EOF
- ClassNotFoundException: org.apache.spark.SparkConf
- How to get previous day date in Hive
- Hive's hour() function returns 12 hour clock v
Basically you need to understand the difference that when to modify SerDe and and when to modify fileformat.
From official documentation: Hive SerDe
What is a SerDe? 1.SerDe is a short name for "Serializer and Deserializer." 2.Hive uses SerDe (and FileFormat) to read and write table rows. 3.HDFS files-->InputFileFormat--> --> Deserializer --> Row object 4.Row object -->Serializer --> --> OutputFileFormat --> HDFS files
So,3rd and 4th points are clearly inferring the difference. You need to have custom fileformat(input/output) when you want to read a record in a different way than usual(where records are separated by '\n'). And you need to have customize SerDe when you want to interpret the read records in a custom way.
Let's take an example of commonly used format JSON.
Scenario 1: Let's say you have an input json file where one line contains one json record. So,now you just needs Custom Serde to interpret the read record in a way you want. No need of custom inout format as 1 line will be 1 record.
Scenario 2: Now if you have an input file where your one json record spans across multiple lines and you want to read it as it is then you should first write a custom input format to read in 1 json record and then this read json record will go to Custom SerDe.
Depends on what you're getting from your text file.
You can write a custom record reader to parse the text log file and return the way you want, Input format class does that job for you. You will use this jar to create the Hive table and load the data in that table.
Talking about SerDe, I use it a little differently. I use both InputFormat and SerDe, former to parse the actual data and latter to get stabilization to my metadata which represents actual data. Reason why I do that? I want to create appropriate columns(not more or less) in hive table for each row of my log file I have and I think SerDe is the perfect solution for me.
Eventually I map those two to create a final table if I want or keep those tables as it is so that I can do joins to query from those.
I like the explanation of Cloudera blog.
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
If you're using Hive, write a serde. See these examples: https://github.com/apache/hive/tree/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2
Note that this interface is Hive specific. If you want to use your custom file format for regular hadoop jobs, you'll have to implement a separate interface (I'm not totally sure which one).
If you already know how to deserialize data in another language, you could just write a streaming job (using any language) and use your existing libraries.
Hope that helps
I figured it out. I did not have to write a serde after all, wrote a custom InputFormat (extends
org.apache.hadoop.mapred.TextInputFormat
) which returns a custom RecordReader (implementsorg.apache.hadoop.mapred.RecordReader<K, V>
). The RecordReader implements logic to read and parse my files and returns tab delimited rows.With that I declared my table as
This uses a native SerDe. Also, it is required to specify an output format when using a custom input format, so I choose one of the built-in output formats.