Running a standalone Hadoop application on multipl

My team built a Java application using the Hadoop libraries to transform a bunch of input files into useful output. Given the current load a single multicore server will do fine for the coming year or so. We do not (yet) have the need to go for a multiserver Hadoop cluster, yet we chose to start this project "being prepared".

When I run this app on the command-line (or in eclipse or netbeans) I have not yet been able to convince it to use more that one map and/or reduce thread at a time. Given the fact that the tool is very CPU intensive this "single threadedness" is my current bottleneck.

When running it in the netbeans profiler I do see that the app starts several threads for various purposes, but only a single map/reduce is running at the same moment.

The input data consists of several input files so Hadoop should at least be able to run 1 thread per input file at the same time for the map phase.

What do I do to at least have 2 or even 4 active threads running (which should be possible for most of the processing time of this application)?

I'm expecting this to be something very silly that I've overlooked.

I just found this: https://issues.apache.org/jira/browse/MAPREDUCE-1367 This implements the feature I was looking for in Hadoop 0.21 It introduces the flag mapreduce.local.map.tasks.maximum to control it.

For now I've also found the solution described here in this question.

标签： java multithreading command-line hadoop mapreduce

4条回答

相关推荐>>

2楼-- · 2019-04-07 04:54

According to this thread on the hadoop.core-user email list, you'll want to change the mapred.tasktracker.tasks.maximum setting to the max number of tasks you would like your machine to handle (which would be the number of cores).

This (and other properties you may want to configure) is also documented in the main documentation on how to setup your cluster/daemons.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-04-07 04:55

I'm not sure if I'm correct, but when you are running tasks in local mode, you can't have multiple mappers/reducers.

Anyway, to set maximum number of running mappers and reducers use configuration options mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum by default those options are set to 2, so I might be right.

Finally, if you want to be prepared for multinode cluster go straight with running this in fully-distributed way, but have all servers (namenode, datanode, tasktracker, jobtracker, ...) run on a single machine

0人赞添加讨论(0) 举报

Lonely孤独者°

4楼-- · 2019-04-07 05:16

What you want to do is run Hadoop in "pseudo-distributed" mode. One machine, but, running task trackers and name nodes as if it were a real cluster. Then it will (potentially) run several workers.

Note that if your input is small Hadoop will decide it's not worth parallelizing. You may have to coax it by changing its default split size.

In my experience, "typical" Hadoop jobs are I/O bound, sometimes memory-bound, way before they are CPU-bound. You may find it impossible to fully utilize all the cores on one machine for this reason.

0人赞添加讨论(0) 举报

迷人小祖宗

5楼-- · 2019-04-07 05:19

Just for clarification... If hadoop runs in local mode you don't have parallel execution on a task level (except you're running >= hadoop 0.21 (MAPREDUCE-1367)). Though you can submit multiple jobs at once and these getting executed in parallel then.

All those

mapred.tasktracker.{map|reduce}.tasks.maximum

properties do only apply to the hadoop running in distributed mode!

HTH Joahnnes

0人赞添加讨论(0) 举报

Running a standalone Hadoop application on multipl

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间