I have created a hive external table which reads a custom file input format. This is working perfectly fine when the file are small. But when the files are big the job is splitting up the files and my job fails.
I'm returning false in my custom input format class for the IsSplittable method. I have also tried setting mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size to large values. I have created a Custom InputFormat, OutputFormat and a SerDe class and used them while creating this table.
In my job logs I'm still seeing the splits happening.
Processing split: Paths:/user/test/testfile1:0+134217728,/user/test/testfile1:134217728+95198924,/user/test/testfile2:0+134217728,/user/test/testfile2:134217728+96092244...
The 134217728 is 128 MB which must be my HDFS block size. Is there a way I can prevent this split from happening? Is it related to this issue https://issues.apache.org/jira/browse/HIVE-8630 ?
My Create table statement is:
CREATE EXTERNAL TABLE test_data(
key STRING,
body map<string, string>
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'com.hiveio.io.CustomHiveSerde'
STORED AS INPUTFORMAT 'com.hiveio.io.CustomHiveInputFormat'
OUTPUTFORMAT 'com.hiveio.io.CustomHiveOutputFormat'
LOCATION '/user/test/';