I would like to build an Erlang/OTP-based system which solves an 'embarassingly parrallel' problem.
I have already read/skimmed through:
- Learn You Some Erlang;
- Programming Erlang (Armstrong);
- Erlang Programming (Cesarini);
- Erlang/OTP in Action.
I have got the gist of Processes, Messaging, Supervisors, gen_servers, Logging, etc.
I do understand that certain architecture choices depend on the application in concern, but still I would like know some general principles of ERlang/OTP system design.
Should I just start with a few gen_servers with a supervisor and incrementally build on that?
How many supervisors should I have? How do I decide which parts of the system should be process-based? How should I avoid bottlenecks?
Should I add logging later?
What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture?
Should I just start with a few gen_servers with a supervisor and incrementally build on that?
You're missing one key component in Erlang architectures here: applications! (That is, the concept of OTP applications, not software applications).
Think of applications as components. A component in your system solves a particular problem, is responsible for a coherent set of resources or abstract something important or complex from the system.
The first step when designing an Erlang system is to decide which applications are needed. Some can be pulled from the web as they are, these we can refer to as libraries. Others you'll need to write yourself (otherwise you wouldn't need this particular system). These applications we usually refer to as the business logic (often you need to write some libraries yourself as well, but it is useful to keep the distinction between the libraries and the core business applications that tie everything together).
How many supervisors should I have?
You should have one supervisor for each kind of process you want to monitor.
A bunch of identical temporary workers? One supervisor to rule them all.
Different process with different responsibilities and restart strategies? A supervisor for each different type of process, in a correct hierarchy (depending on when things should restart and what other process needs to go down with them?).
Sometimes it is okay to put a bunch of different process types under the same supervisor. This is usually the case when you have a few singleton processes (e.g. one HTTP server supervisor, one ETS table owner process, one statistics collector) that will always run. In that case, it might be too much cruft to have one supervisor for each, so it is common to add the under one supervisor. Just be aware of the implications of using a particular restart strategy when doing this, so you don't take down your statistics process for example, in case your web server crashes (
one_for_one
is the most common strategy to use in cases like this). Be careful not to have any dependencies between processes in aone_for_one
supervisor. If a process depends on another crashed process, it can crash as well, triggering the supervisors' restart intensity too often and crash the supervisor itself too soon. This can be avoided by having two different supervisors, which would completely control the restarts by the configured intensity and period (longer explanation).How do I decide which parts of the system should be process-based?
Every concurrent activity in your system should be in it's own process. Having the wrong abstraction of concurrency is the most common mistake by Erlang system designers in the beginning.
Some people are not used to deal with concurrency; their systems tend to have too little of it. One process, or a few gigantic ones, that runs everything in sequence. These systems are usually full of code smell and the code is very rigid and hard to refactor. It also makes them slower, because they may not use all the cores available to Erlang.
Other people immediately grasp the concurrency concepts but fail to apply them optimally; their systems tend to overuse the process concept, making many process stay idle waiting for others that are doing work. These systems tend to be unnecessarily complex and hard to debug.
In essence, in both variants you get the same problem, you don't use all the concurrency available to you and you don't get the maximum performance out of the system.
If you stick to the single responsibility principle and abide by the rule to have a process for every truly concurrent activity in your system, you should be okay.
There are valid reasons to have idle processes. Sometimes they keep important state, sometimes you want to keep some data temporarily and later discard the process, sometimes they wait on external events. The bigger pitfall is to pass important messages through a long chain of largely inactive processes, as it will slow down your system with lots of copying and use more memory.
How should I avoid bottlenecks?
Hard to say, depends very much on your system and what it's doing. Generally though, if you have a good division of responsibility between applications you should be able to scale the application that appears to be the bottleneck separately from the rest of the system.
The golden rule here is to measure, measure, measure! Don't think you have something to improve until you've measured.
Erlang is great in that it allows you to hide concurrency behind interfaces (known as implicit concurrency). For example, you use a functional module API, a normal
module:function(Arguments)
interface, that could in turn spawn thousands of processes without the caller having to know that. If you got your abstractions and your API right, you can always parallelize or optimize a library after you've started using it.That being said, here are some general guide lines:
And one bonus advice: don't reuse processes. Spawning a process in Erlang is so cheap and quick that it doesn't make sense to re-use a process once its lifetime is over. In some cases it might make sense to re-use state (e.g. complex parsing of a file) but that is better canonically stored somewhere else (in an ETS table, database etc.).
Should I add logging later?
You should add logging now! There's a great built-in API called Logger that comes with Erlang/OTP from version 21:
This new API has several advanced features and should cover most cases where you need logging. There's also the older but still widely used 3rd party library Lager.
What is the general approach to Erlang/OTP distributed fault-tolerant multiprocessors systems architecture?
To summarize what's been said above:
Common pitfalls: