hadoop and hbase rebalancing after node additions

2020-05-23 03:19发布

I have a fundamental question about load balancer. I just finished adding new nodes to our hadoop(2.3) cluster which also has hbase v0.98. After the addition and having all nodes online in hadoop and hbase,

  1. How is hbase affected by hadoop rebalancer? Do I need to explicitly try to rebalance hbase after hadoop rebalance?

  2. My Hadoop cluster is entirely occupied by hbase. Setting balancer_switch=true, will it automatically rebalance hbase and hadoop?

  3. What is the best way to make sure that both hadoop and hbase are rebalanced and work fine too?

标签: hadoop hbase
2条回答
叛逆
2楼-- · 2020-05-23 03:42

Hadoop does not do block level balancing by default. There are some tools you can use to manually do balancing in Hadoop, namely https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html#balancer. Note that balancing HDFS is actually quite expensive if you have a small number of completely empty or new nodes that you have just added to an otherwise full cluster, and my experience with it, is that it only does an alright job of balancing the HDFS blocks. Running the balancer multiple times can improve the overall balance. There are also some alternative implementations that can do a better job of balancing than the one built-in to Hadoop.

You can inspect the balance of blocks from the HDFS NameNode UI if you click on the "Live Nodes" link. The "Block Pool Used" column is the useful column for this purpose. If you see a high variance in the percentage of blocks used on the various machines, then you may need to rebalance your HDFS cluster.

The balancer_switch only affects regionserver balance. HBase will automatically balance your regions in the cluster by default, but you can manually run the balancer at any time from the hbase shell.

You can inspect the region balance from the main page of the HBase master UI under the "Region Servers section" in the column named "Load", there is a value named "numberOfOnlineRegions". In general, HBase does a pretty good job of keeping this balanced. I've only seen a few times when I've initially created tables that the default balancing algorithm comes up with a skewed set of regions. Regardless, the region balancer is actually fairly cheap and can be done quite quickly. Running it once is usually sufficient to get you in to a very balanced state.

查看更多
我只想做你的唯一
3楼-- · 2020-05-23 03:50
  1. The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases's data locality, meaning that a particular region may be serving a file that is no longer on it's local host.

  2. HBase's balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop's (HDFS) balancer.

  3. If you are running only HBase, I recommend not running Hadoop's (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it's HFile.

HBase's data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

查看更多
登录 后发表回答