I'm trying to move away from SQS to RabbitMQ for messaging service. I'm looking to build a stable high availability queuing service. For now I'm going with cluster.
Current Implementation , I have three EC2 machines with RabbitMQ with management plugin installed in a AMI , and then I explicitly go to each of the machine and add
sudo rabbitmqctl join_cluster rabbit@<hostnameOfParentMachine>
With HA property set to all and the synchronization works. And a load balancer on top it with a DNS assigned. So far this thing works.
Expected Implementation: Create an autoscaling clustered environment where the machines that go Up/Down has to join/Remove the cluster dynamically. What is the best way to achieve this? Please help.
We recently had similar problem.
We tried to use https://github.com/rabbitmq/rabbitmq-autocluster but found it overcomplicated for our use case.
I created terraform configuration to spin X RabbitMQ nodes on Y subnets (availability zones) using Autoscaling Group.
TL;DR https://github.com/ulamlabs/rabbitmq-aws-cluster
The configuration creates IAM role to allow nodes to autodiscover all other nodes in the Autoscaling Group.
I had a similar configuration 2 years ago.
I decided to use amazon VPC, by default my design had two RabbitMQ instances always running, and configured in cluster (called master-nodes). The rabbitmq cluster was behind an internal amazon load balancer.
I created an AMI with RabbitMQ and management plug-in configured (called “master-AMI”), and then I configured the autoscaling rules.
if an autoscaling alarm is raised a new master-AMI is launched. This AMI executes the follow script the first time is executed:
The scripts uses the HTTP API “http://internal-myloadbalamcer-xxx.com:15672/api/nodes” to find nodes, then choose one and binds the new AMI to the cluster.
As HA policy I decided to use this:
Well, the join is “quite” easy, the problem is decide when you can remove the node from the cluster.
You can’t remove a node based on autoscaling rule, because you can have messages to the queues that you have to consume.
I decided to execute a script periodically running to the two master-node instances that:
This is broadly what I did, hope it helps.
[EDIT]
I edited the answer, since there is this plugin that can help:
I suggest to see this: https://github.com/rabbitmq/rabbitmq-autocluster
The plugin has been moved to the official RabbitMQ repository, and can easly solve this kind of the problems