Explanation for Hadoop Mapreduce Console Output

2019-02-05 09:57发布

问题:

I am newbie in hadoop environment. I already set up 2 node cluster hadoop. then I run sample mapreduce application. (wordcount actually). then I got output like this

    File System Counters
    FILE: Number of bytes read=492
    FILE: Number of bytes written=6463014
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=71012
    HDFS: Number of bytes written=195
    HDFS: Number of read operations=404
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
Job Counters 
    Launched map tasks=80
    Launched reduce tasks=1
    Data-local map tasks=80
    Total time spent by all maps in occupied slots (ms)=429151
    Total time spent by all reduces in occupied slots (ms)=72374
Map-Reduce Framework
    Map input records=80
    Map output records=8
    Map output bytes=470
    Map output materialized bytes=966
    Input split bytes=11040
    Combine input records=0
    Combine output records=0
    Reduce input groups=1
    Reduce shuffle bytes=966
    Reduce input records=8
    Reduce output records=5
    Spilled Records=16
    Shuffled Maps =80
    Failed Shuffles=0
    Merged Map outputs=80
    GC time elapsed (ms)=5033
    CPU time spent (ms)=59310
    Physical memory (bytes) snapshot=18515763200
    Virtual memory (bytes) snapshot=169808543744
    Total committed heap usage (bytes)=14363394048
Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
File Input Format Counters 
    Bytes Read=29603
File Output Format Counters 
    Bytes Written=195

Are there any explanation about every data which I got? especially,

  1. Total time spent by all maps in occupied slots (ms)
  2. Total time spent by all reduces in occupied slots (ms)
  3. CPU time spent (ms)
  4. Physical memory (bytes)
  5. Virtual memory (bytes) snapshot
  6. Total committed heap usage (bytes)

回答1:

Mapreduce framwork maintains counters while the job has been submitted for execution. These counters are shown to user for understaing job statistics and to see benchmarks and performance analysis. Your job output has shown you some of the counters. There is a good explanation in definitive guide chapter 8 about the counters, i suggest you to check it once.

To explain about the items you asked for,

1) Total time spent by all maps - The total time taken running map tasks in milliseconds. Includes tasks that were started speculatively (Speculative means running a failed or slow job after waiting for specified time, in lament terms a speculative job means re-run of any particular map task).

2) Total time spent by all reduces - The total time taken running reduce tasks in milliseconds.

3) CPU Time - The cumulative CPU time for a task in milliseconds

4) Physical memory - The physical memory being used by a task in bytes, memory here counts the total memory used for spills as well.

5) Total virtual memory - The virtual memory being used by a task in bytes

6) Total committed heap usage - The total amount of memory available in the JVM in bytes

Hope this helps. The categories of counters and their details are neatly given in definitive guide, if you need any additional info, please let me know.

Thank You.

Extra details after comment--

RAM is the primary memory that is used when processing a job. The data will be brought to RAM and job gets processed keep it there in RAM. But, data might be bigger that the RAM size allocated. In such scenarios, Operating system keeps the data in Disk and swaps it to and from RAM to allow even lessar RAM is sufficient for files those are higher in memory. for eg: RAM is 64MB, and suppose if the file size is 128 MB, then 64MB will be kept in RAM first and other 64 in DISK, and swaps it. Though it wont keep it as 64MB and 64 MB, internally it divides into segments/pages.

I just gave an example to understand. A virtual memory is a concept to work for files bigger than RAM by using the pages and swapping with DISK and RAM. So for above case, it virtually using 64 MB from Disk as RAM so it is called as Virtual memory.

Hope you understand. If you satisfied with the answer, please accept it as answer. Let me know if you have any questions.

Heap the JVM memory used for object store, which is set using JVM_OPTS in command line. Normally all java programs need to have these settings.