Selectively loading iis log files into Hive

2019-08-22 04:57发布

I am just getting started with Hadoop/Pig/Hive on the cloudera platform and have questions on how to effectively load data for querying.

I currently have ~50GB of iis logs loaded into hdfs with the following directory structure:

/user/oi/raw_iis/Webserver1/Org/SubOrg/W3SVC1056242793/ /user/oi/raw_iis/Webserver2/Org/SubOrg/W3SVC1888303555/ /user/oi/raw_iis/Webserver3/Org/SubOrg/W3SVC1056245683/

etc

I would like to load all the logs into a Hive table.

I have two issues/questions:

1.

My first issue is that some of the webservers may not have been configured correctly and will have iis logs without all columns. These incorrect logs need additional processing to map the available columns in the log to the schema that contains all columns.

The data is space delimited, the issue is that when not all columns are enabled, the log only includes the columns enabled. Hive cant automatically insert nulls since the data does not include the columns that are empty. I need to be able to map the available columns in the log to the full schema.

Example good log:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 GET /common/viewFile/1232 Mozilla/5.0+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/27.0.1453.116+Safari/537.36

Example log with missing columns (cs-method and useragent):

#Fields: date time s-ip cs-uri-stem 
2013-07-16 00:00:00 10.1.15.8 /common/viewFile/1232

The log with missing columns needs to be mapped to the full schema like this:

#Fields: date time s-ip cs-method cs-uri-stem useragent
2013-07-16 00:00:00 10.1.15.8 null /common/viewFile/1232 null

How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing? Is this something I could handle with a Pig script?

2.

How can I define my Hive tables to include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive? I am also unsure how to properly import data from the many directories into a single hive table.

2条回答
【Aperson】
2楼-- · 2019-08-22 05:32

I was able to solve both my issues with Pig UDF (user defined functions)

  1. Mapping columns to proper schema: See this answer and this one.

All I really had to do is add some logic to handle the iis headers that start with #. Below are the snippets from getNext() that I used, everything else is the same as mr2ert's example code.

See the values[0].equals("#Fields:") parts.

        @Override
        public Tuple getNext() throws IOException {
            ...

            Tuple t =  mTupleFactory.newTuple(1);

            // ignore header lines except the field definitions
            if(values[0].startsWith("#") && !values[0].equals("#Fields:")) {
                return t;
            }
            ArrayList<String> tf = new ArrayList<String>();
            int pos = 0;

            for (int i = 0; i < values.length; i++) {
                if (fieldHeaders == null || values[0].equals("#Fields:")) {
                    // grab field headers ignoring the #Fields: token at values[0]
                    if(i > 0) {
                        tf.add(values[i]);
                    }
                    fieldHeaders = tf;
                } else {
                    readField(values[i], pos);
                    pos = pos + 1;
                }
            }
            ...
         }
  1. To include information from the file path, I added the following to my LoadFunc UDF that I used to solve 1. In the prepareToRead override, grab the filepath and store it in a member variable.

    public class IISLoader extends LoadFunc {
        ...
        @Override
        public void prepareToRead(RecordReader reader, PigSplit split) {
            in = reader;
            filePath = ((FileSplit)split.getWrappedSplit()).getPath().toString();
        }
    

Then within getNext() I could add the path to the output tuple.

查看更多
干净又极端
3楼-- · 2019-08-22 05:39

First provide Sample data for better help.

How can I map these enabled fields to a schema that includes all possible columns, inserting blank/null/- token for fields that were missing?

If you have delimiter in file you can use Hive and hive automatically inserts nulls properly wherever data is not there.provided that you do not have delimiter as part of your data.

Is this something I could handle with a Pig script?

If you have delimiter among the fields then you can use Hive ,otherwise you can go for mapreduce/pig.

How can I include information from the hdfs path, namely Org and SubOrg in my dir structure example so that it is query-able in Hive?

Seems you are new bee in hive,before querying you have to create a table which includes information like path,delimiter and schema.

Is this a good candidate for partitioning?

You can apply partition on date if you wish.

查看更多
登录 后发表回答