Mapreduce XML input format - to build custom forma

2019-08-14 22:12发布

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.

So I think we need a custom input format to scan the XML datasets.

Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?

thanks nath

1条回答
【Aperson】
2楼-- · 2019-08-14 22:47

Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?

Solution MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.

So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.

Pls have a look at XmlInputFormat implementation details here.

Furthermore, XmlInputFormat extends TextInputFormat

查看更多
登录 后发表回答