I have created a hive external table which reads a custom file input format. This is working perfectly fine when the file are small. But when the files are big the job is splitting up the files and my job fails.
I'm returning false in my custom input format class for the IsSplittable method. I have also tried setting mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size to large values. I have created a Custom InputFormat, OutputFormat and a SerDe class and used them while creating this table.
In my job logs I'm still seeing the splits happening.
Processing split: Paths:/user/test/testfile1:0+134217728,/user/test/testfile1:134217728+95198924,/user/test/testfile2:0+134217728,/user/test/testfile2:134217728+96092244...
The 134217728 is 128 MB which must be my HDFS block size. Is there a way I can prevent this split from happening? Is it related to this issue https://issues.apache.org/jira/browse/HIVE-8630 ?
My Create table statement is:
CREATE EXTERNAL TABLE test_data(
key STRING,
body map<string, string>
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'com.hiveio.io.CustomHiveSerde'
STORED AS INPUTFORMAT 'com.hiveio.io.CustomHiveInputFormat'
OUTPUTFORMAT 'com.hiveio.io.CustomHiveOutputFormat'
LOCATION '/user/test/';
Ok..actually, you mentioning https://issues.apache.org/jira/browse/HIVE-8630 rang a bell. A while ago we dealt with a very similar problem. The bug mentions how CombineHiveInputFormat will still split unsplittable formats. CombineHiveInputFormat is the default HiveInputFormat and its purpose is it combine multiple small files to reduce overhead. You can disable it, setting
before the query, or set it as xml in hive-site.xml if you want it as default:
Note that you'll be sacrificing the feature of the Combine part, so if you have many small files, they'll each one take a mapper when processing....but this should work, it did work for us.