Background
I'm trying to setup clustering between a few elixir nodes. My understanding is that I can set this up by modifying the release vm.args. I'm using Distillery to build releases and am following the documentation here: https://hexdocs.pm/distillery/config/runtime.html.
My rel/vm.args file is as follows:
-name <%= release_name %>@${HOSTNAME}
-setcookie <%= release.profile.cookie %>
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '[${SYNC_NODES_MANDATORY}]'
I have a build server running Ubuntu 18.04 and two webservers running Ubuntu 18.04. I'm building the release on the build server, copying the archive to the webservers and, unarchiving it and starting it there.
On the server the two vm.args files are calculated to be:
-name hifyre_platform@10.10.10.100
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'
and
-name hifyre_platform@10.10.10.200
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'
The releases are run via systemd with the following configuration:
[Unit]
Description=My App
After=network.target
[Service]
Type=simple
User=ubuntu
Group=ubuntu
WorkingDirectory=/opt/app
ExecStart=/opt/app/bin/my_app foreground
Restart=on-failure
RestartSec=5
Environment=PORT=8080
Environment=LANG=en_US.UTF-8
Environment=REPLACE_OS_VARS=true
Environment=HOSTNAME=10.10.10.100
SyslogIdentifier=my_app
RemainAfterExit=no
[Install]
WantedBy=multi-user.target
Problem
The releases start fine on both servers and but when I open a remote console and run Node.list()
the result is an empty list unless I manually connect the two nodes.
If I manually run Node.connect(:"my_app@10.10.10.200")
I then see the other node when running Node.list()
on each node, but this does not happen automatically on startup.
The vm.args
file ends up getting passed to Erlang using the -args_file
argument. I went to look at the documentation for -args_file
, and found that it's actually not very well documented. It turns out that vm.args
is like an onion, in that it has lots of layers, and the documentation seems to be all in the source code.
Let's start with where we want to end up. We want sync_nodes_mandatory
to be a list of atoms, and we need to write it in Erlang syntax. If we were using short node names, e.g. my_app@myhost
, we could get away with not quoting the atoms, but atoms with dots in them need to be quoted using single quotes:
['my_app@10.10.10.100','my_app@10.10.10.200']
We want this to be the output of the function build_args_from_string
in erlexec.c
. This function has four rules:
- A backslash character escapes any one character
- A double quote escapes all characters (including backslash) until the next double quote
- A single quote escapes all characters (including backslash) until the next single quote
- A space character marks the end of an argument
So since we want to pass the single quotes through to the parser, we have two alternatives. We can escape the single quotes:
[\'my_app@10.10.10.100\',\'my_app@10.10.10.200\']
Or we can enclose the single quotes in double quotes:
["'my_app@10.10.10.100','my_app@10.10.10.200'"]
(In fact, it doesn't matter how many and where we put the double quotes, as long as every occurrence of a single quote is inside a pair of double quotes. This is just one possible way of doing it.)
BUT if we choose to escape the single quotes with backslashes, we encounter another layer! The function read_args_file
is the function that actually reads the vm.args
file from disk before passing it to build_args_from_string
, and it imposes its own rules first! Namely:
- A backslash character escapes any one character
- A
#
character ignores all characters until the next newline
- Any whitespace character is replaced by a single space, unless escaped by a backslash
So if we were to write [\'my_app@10.10.10.100\',\'my_app@10.10.10.200\']
in vm.args
, read_args_file
would eat the backslashes, and build_args_from_string
would eat the single quotes, leaving us with an invalid term and an error:
$ iex --erl '-args_file /tmp/vm.args'
2019-04-25 17:00:02.966277 application_controller: ~ts: ~ts~n
["syntax error before: ","'.'"]
"[my_app@10.10.10.100,my_app@10.10.10.200]"
{"could not start kernel pid",application_controller,"{bad_environment_value,\"[my_app@10.10.10.100,my_app@10.10.10.200]\"}"}
could not start kernel pid (application_controller) ({bad_environment_value,"[my_app@10.10.10.100,my_app@10.10.10.200]"})
Crash dump is being written to: erl_crash.dump...done
So we could either use double backslashes:
-kernel sync_nodes_mandatory [\\'my_app@10.10.10.100\\',\\'my_app@10.10.10.200\\']
Or just stick with double quotes (a different, equally valid, variant this time):
-kernel sync_nodes_mandatory "['my_app@10.10.10.100','my_app@10.10.10.200']"
As noted in the documentation for the kernel
application, you also need to set sync_nodes_timeout
to a time in milliseconds or infinity
:
Specifies the time (in milliseconds) that this node waits for the mandatory and optional nodes to start. If this parameter is undefined, no node synchronization is performed.
Add something like:
-kernel sync_nodes_timeout 10000
Here's an alternative solution. I found it while investigating this problem.
Create file ./priv/sync.config
with the following content:
[{kernel, [
{sync_nodes_mandatory, ['my_app@10.10.10.200', 'my_app@10.10.10.200']},
{sync_nodes_timeout, 15000}
]}].
Add this line to vm.args
:
-config <%= :code.priv_dir(release_name) %>/sync
Build a release and start both nodes within 15 seconds (timeout value from config file) with console attached. Execute Node.list()
to verify.
Now you might consider generating this config file when building a release.