Mapreduce XML input format - to build custom forma

2019-08-14 22:12发布

If the input files in XML format, I shouldn't be using TextInputFormat because TextInputFormat assumes each record is in each line of the input file and the Mapper class is called for each line to get a Key Value pair for that record/line.

So I think we need a custom input format to scan the XML datasets.

Being new to Hadoop mapreduce, is there any article/link/video that shows the steps to build a custom input format?

thanks nath

标签： hadoop xml-parsing mapreduce

1条回答

【Aperson】

2楼-- · 2019-08-14 22:47

Problem Working on a single XML file in parallel in MapReduce is tricky because XML does not contain a synchronization marker in its data format. Therefore, how do we work with a file format that’s not inherently splittable like XML?

Solution MapReduce doesn’t contain built-in support for XML, so we have to turn to another Apache project, Mahout, a machine learning system, which provides an XML InputFormat.

So I mean no need to have custom input format since Mahout library present. I am not sure, whether you are going to read or write but both were described in above link.

Pls have a look at XmlInputFormat implementation details here.

Furthermore, XmlInputFormat extends TextInputFormat

0人赞添加讨论(0) 举报

Mapreduce XML input format - to build custom forma

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间