In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
one,first line
two,second line
Ouput Required :
Key : one
Value : first line
Key : two
Value : second line
I am specifying KeyValueTextInputFormat as :
Job job = new Job(conf, "Sample");
job.setInputFormatClass(KeyValueTextInputFormat.class);
KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));
This is working fine for tab as a separator.
Example
First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve. Ignore the LongWritable key, and split the Text value on comma yourself.
By default, the
KeyValueTextInputFormat
class uses tab as a separator for key and value from input text file.If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.
For the new Hadoop APIs, it is different:
For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"
By changing default seperator, You will be able to read as you wish.
For New Api
Here is the solution
Map
Output
It's a sequence matter.
The first line
conf.set("key.value.separator.in.input.line", ",")
must come before you create an instance ofJob
class. So:In the newer API you should use
mapreduce.input.keyvaluelinerecordreader.key.value.separator
configuration property.Here's an example: