Let's say we have an 'intrinsically parallel' problem to solve with our Erlang software. We have a lot of parallel processes and each of them executes sequential code (not number crunching) and the more CPUs we throw at them the better.
I have heard about CUDA bindings for Erlang, but after watching the Kevin Smith's presentation I am not sure that it is the solution: the whole purpose of pteracuda buffer is to assign a difficult number crunching task to the buffer and get the result back. It is not possible to use GPU's processors to serve Elrang's processes. (Am I right?).
On the other side multicore CPUs are really expensive (8 cores CPU prices start at $300). So, to build a 10-machine Erlang parallel processing 'cluster' you have to spend at least $3000 on CPUs only.
So, the question is: What kind of affordable CPU or GPU can be used to build a 'server cluster' for a parallel Erlang software?
I would check into Amazon EC2. You can rent servers for very cheap, and can spin the servers up close to instantly if there is work to be done. You can also bid on very cheap spot instances. This would at least give you a great way to test your code on multiple boxes, and allow for some testing if you want to buy the hardware at a later date. They also have GPU instances available as well (for a more expensive rate), which have Tesla GPU's and quad core hyper threaded processors. Here is a list of all the types available.
Here is a simple guide I found to help you get started setting up a master node that can spin up additional nodes if needed.
There was a student project at Uppsala University in 2009 called LOPEC that had this aim, in cooperation with Erlang Solutions (then still called Erlang Training & Consulting, or ETC for short).
I couldn't find any slides from their final project report, but this is a poster they showed at the Erlang User Conference in 2009: http://www.it.uu.se/edu/course/homepage/projektDV/ht09/ETC_poster.pdf
Parts of the code seems to live on here: https://github.com/burbas/LoPECv2 (the user burbas was one of the students), but strangely incomplete. You could ask burbas for more info.
There is of course also the Disco project by Nokia: http://discoproject.org/
In both cases, I think you'll need to write a C or Python stub to run on the clients to talk to the GPU (or you might run Erlang with CUDA bindings on the clients); the above frameworks just help you distribute the workload and gather results.