Background
I'm trying to setup clustering between a few elixir nodes. My understanding is that I can set this up by modifying the release vm.args. I'm using Distillery to build releases and am following the documentation here: https://hexdocs.pm/distillery/config/runtime.html.
My rel/vm.args file is as follows:
-name <%= release_name %>@${HOSTNAME}
-setcookie <%= release.profile.cookie %>
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '[${SYNC_NODES_MANDATORY}]'
I have a build server running Ubuntu 18.04 and two webservers running Ubuntu 18.04. I'm building the release on the build server, copying the archive to the webservers and, unarchiving it and starting it there.
On the server the two vm.args files are calculated to be:
-name hifyre_platform@10.10.10.100
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'
and
-name hifyre_platform@10.10.10.200
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'
The releases are run via systemd with the following configuration:
[Unit]
Description=My App
After=network.target
[Service]
Type=simple
User=ubuntu
Group=ubuntu
WorkingDirectory=/opt/app
ExecStart=/opt/app/bin/my_app foreground
Restart=on-failure
RestartSec=5
Environment=PORT=8080
Environment=LANG=en_US.UTF-8
Environment=REPLACE_OS_VARS=true
Environment=HOSTNAME=10.10.10.100
SyslogIdentifier=my_app
RemainAfterExit=no
[Install]
WantedBy=multi-user.target
Problem
The releases start fine on both servers and but when I open a remote console and run Node.list()
the result is an empty list unless I manually connect the two nodes.
If I manually run Node.connect(:"my_app@10.10.10.200")
I then see the other node when running Node.list()
on each node, but this does not happen automatically on startup.
The
vm.args
file ends up getting passed to Erlang using the-args_file
argument. I went to look at the documentation for-args_file
, and found that it's actually not very well documented. It turns out thatvm.args
is like an onion, in that it has lots of layers, and the documentation seems to be all in the source code.Let's start with where we want to end up. We want
sync_nodes_mandatory
to be a list of atoms, and we need to write it in Erlang syntax. If we were using short node names, e.g.my_app@myhost
, we could get away with not quoting the atoms, but atoms with dots in them need to be quoted using single quotes:We want this to be the output of the function
build_args_from_string
inerlexec.c
. This function has four rules:So since we want to pass the single quotes through to the parser, we have two alternatives. We can escape the single quotes:
Or we can enclose the single quotes in double quotes:
(In fact, it doesn't matter how many and where we put the double quotes, as long as every occurrence of a single quote is inside a pair of double quotes. This is just one possible way of doing it.)
BUT if we choose to escape the single quotes with backslashes, we encounter another layer! The function
read_args_file
is the function that actually reads thevm.args
file from disk before passing it tobuild_args_from_string
, and it imposes its own rules first! Namely:#
character ignores all characters until the next newlineSo if we were to write
[\'my_app@10.10.10.100\',\'my_app@10.10.10.200\']
invm.args
,read_args_file
would eat the backslashes, andbuild_args_from_string
would eat the single quotes, leaving us with an invalid term and an error:So we could either use double backslashes:
Or just stick with double quotes (a different, equally valid, variant this time):
As noted in the documentation for the
kernel
application, you also need to setsync_nodes_timeout
to a time in milliseconds orinfinity
:Add something like:
Here's an alternative solution. I found it while investigating this problem.
Create file
./priv/sync.config
with the following content:Add this line to
vm.args
:Build a release and start both nodes within 15 seconds (timeout value from config file) with console attached. Execute
Node.list()
to verify.Now you might consider generating this config file when building a release.