Reading a excel file in hadoop map reduce

2019-08-03 01:04发布

问题:

I am trying to read a Excel file containing some data for aggregation in hadoop.The map reduce program seems to be working fine but the output produce is in a non readable format.Do I need to use any special InputFormat reader for Excel file in Hadoop Map Reduce ?.My configuration is as below

   Configuration conf=getConf();
Job job=new Job(conf,"LatestWordCount");
job.setJarByClass(FlightDetailsCount.class);
Path input=new Path(args[0]);
Path output=new Path(args[1]);
FileInputFormat.setInputPaths(job, input);
FileOutputFormat.setOutputPath(job, output);
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);
//job.setCombinerClass(ReduceClass.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
//job.setOutputKeyClass(Text.class);
//job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)?0:1);
return 0;

The output produce looks like this �KW ��O�A��]n��Ε��r3�\n"���p�饚6W�jJ���9W�f=��9ml��dR�y/Ք��7�^�i ��M*Ք�^nz��l��^�)��妗j�(��dRͱ/7�TS*��M//7�TS��&�jZ��o��TSR�7�@�)�o��TӺ��5{%��+��ۆ�w6-��=�e�_}m�)~��ʅ��ژ���: #�j�]��u����>

回答1:

I don't know if someone actually developed a custom InputFormat for MS Excel files (I doubt it and quick research turns up nothing), but you most certainly can not read an Excel file using the TextInputFormat. XSL files are binary.

Solution: Export your Excel file to CSV or TSV, then you'll be able to load them using the TextInputFormat.



回答2:

I know it is a bit late, but now someone has already created excel input format as an standard solution for this kind of problem. Read this -https://sreejithrpillai.wordpress.com/2014/11/06/excel-inputformat-for-hadoop-mapreduce/

A github project is there with codebase.

Look here - https://github.com/sreejithpillai/ExcelRecordReaderMapReduce/



回答3:

You can also use the HadoopOffice library, which allows you to read/write Excel with Hadoop and Spark. It is available on Maven Central and Spark packages.

https://github.com/ZuInnoTe/hadoopoffice/wiki