I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.
I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:
/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/ /user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/ /user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/
etc
I would like to load all the logs into a Hive table.
I have two issues/questions:
1.
My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.
The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.
Example good log:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36
Example log with missing columns (cs-method and useragent):
#Fields: date time s-ip cs-uri-stem
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232
The log with missing columns needs to be mapped to the full schema like this:
#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null
How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?
2.
How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.
I was able to solve both my issues with Pig UDF (user defined functions)
All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.
See the values[0].equals("#Fields:") parts.
To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.
Then within getNext() I could add the path to the output tuple.
First provide Sample data for better help.
How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing?
If you have delimiter in file you can use Hive and hive automatically inserts nulls properly wherever data is not there.provided that you do not have delimiter as part of your data.
Is this something I could handle with a Pig script?
If you have delimiter among the fields then you can use
Hive
,otherwise you can go formapreduce/pig
.How can I include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive?
Seems you are new bee in hive,before querying you have to create a table which includes information like
path,delimiter and schema
.Is this a good candidate for partitioning?
You can apply
partition on date
if you wish.