-->

What happens if the leader is not dead but unable

2020-07-25 09:03发布

问题:

I have 3 brokers, 3 partitions. Each broker is a leader for one partition and the ISRs for all. Let us say that I have run the brokers on the ports 19092,29092,39092 respectively.

19092 - partition 0
29092 - partition 1
39092 - partition 2

The Half-broker test:

I would like to name it this way! Because it allows only OUTPUT but not INPUT

Now, I have add the following iptables rule:

iptables -A INPUT -p tcp --dport 29092 -j DROP

and in the Producer:

bin/kafka-console-producer --broker-list 10.54.8.172:19092 --topic ftest

The above iptables rule blocks INPUT access but doesn't restrict the broker from updating its aliveness with the Zookeeper. So zookeeper will not take it to be dead and so will not conduct leader election for partition 1.

But, the producer is not able to connect to it because of the RULE and hence throws error.

org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for ftest-1: 1778 ms has passed since batch creation plus linger time

This, I have done manually, but there can be other reasons why INPUT access may be blocked (some malware, DDoS or anything else).

Before iptables RULE:

Metadata for ftest (from broker 1: 10.54.8.172:19092/1):

 3 brokers:

  broker 2 at 10.54.8.172:29092

  broker 1 at 10.54.8.172:19092

  broker 3 at 10.54.8.172:39092

 1 topics:

  topic "ftest" with 3 partitions:

    partition 2, leader 3, replicas: 3,1,2, isrs: 3,1,2

    partition 1, leader 2, replicas: 2,3,1, isrs: 2,3,1

    partition 0, leader 1, replicas: 1,2,3, isrs: 1,2,3

After iptables RULE:

Metadata for ftest (from broker 1: 10.54.8.172:19092/1):

 3 brokers:

  broker 2 at 10.54.8.172:29092

  broker 1 at 10.54.8.172:19092

  broker 3 at 10.54.8.172:39092

 1 topics:

  topic "ftest" with 3 partitions:

    partition 2, leader 3, replicas: 3,1,2, isrs: 3,1,2

    partition 1, leader 2, replicas: 2,3,1, isrs: 2

    partition 0, leader 1, replicas: 1,2,3, isrs: 1,2,3

Since, there is only one leader and it is dead (in the sense it cannot receive any messages), is not a single point of failure?

I think, there must ideally be 2 way communication between Zookeeper and Kafka brokers. Isn't it? Does Kafka allow it? If so, how?

Also, when the 29092 is blocked for INPUT access its ISR shrinked to 1.

It could be because it is not able to receive any messages (heartbeats) from the other 2 brokers.

If it can connect (OUTPUT is enabled), then it can write to them and for the replication to get acknowledged, it needs INPUT access.

So both INPUT and OUTPUT should be there here also.

The broker 29092 is as good as nothing here. Leaving the system in an unrecoverable state!

回答1:

Your question is probably best answered by understanding how Kafka leverages zookeeper primitives for maintaining and organizing cluster state.

In Kafka, leadership election is orchestrated by one of the broker which acts as a controller. There is only one controller and it is elected among the brokers using zookeeper.

Now, each broker registers itself as an "ephemeral node" in zookeeper. So the broker which initiated the zK session maintains the membership by using periodic heartbeats (ticks in zK terms). If a broker fails to tick within timeout interval, zookeeper removes that node and Kafka controller which has registered itself to be notified of that event (via zK watches) gets notified. This triggers a new leader election if the failed broker is a leader for a partition. Controller handles leader election and notifies all the brokers.

So yes, there is a 2 way communication between Kafka and zK - but this is not a direct 2 way communication between every broker and zK as far as partition leader election is concerned. There is a middleman in the way of a controller.

In your test, as the controller never gets notified of failure of broker 2, so that broker remains the leader of partition 1.

Now onwards, I am speculating

Your broker 2 which has input blocked cannot receive metadata updates, so it fences itself by shrinking ISR to itself. This might help as well.