I've started tinkering with Node.js HTTP server and really like to write server side Javascript but something is keeping me from starting to use Node.js for my web application.
I understand the whole async I/O concept but I'm somewhat concerned about the edge cases where procedural code is very CPU intensive such as image manipulation or sorting large data sets.
As I understand it, the server will be very fast for simple web page requests such as viewing a listing of users or viewing a blog post. However, if I want to write very CPU intensive code (in the admin back end for example) that generates graphics or resizes thousands of images, the request will be very slow (a few seconds). Since this code is not async, every requests coming to the server during those few seconds will be blocked until my slow request is done.
One suggestion was to use Web Workers for CPU intensive tasks. However, I'm afraid web workers will make it hard to write clean code since it works by including a separate JS file. What if the CPU intensive code is located in an object's method? It kind of sucks to write a JS file for every method that is CPU intensive.
Another suggestion was to spawn a child process, but that makes the code even less maintainable.
Any suggestions to overcome this (perceived) obstacle? How do you write clean object oriented code with Node.js while making sure CPU heavy tasks are executed async?
What you need is a task queue! Moving your long running tasks out of the web-server is a GOOD thing. Keeping each task in "separate" js file promotes modularity and code reuse. It forces you to think about how to structure your program in a way that will make it easier to debug and maintain in the long run. Another benefit of a task queue is the workers can be written in a different language. Just pop a task, do the work, and write the response back.
something like this https://github.com/resque/resque
Here is an article from github about why they built it http://github.com/blog/542-introducing-resque
This is misunderstanding of the definition of web server -- it should only be used to "talk" with clients. Heavy load tasks should be delegated to standalone programs (that of course can be also written in JS).
You'd probably say that it is dirty, but I assure you that a web server process stuck in resizing images is just worse (even for lets say Apache, when it does not block other queries). Still, you may use a common library to avoid code redundancy.
EDIT: I have come up with an analogy; web application should be as a restaurant. You have waiters (web server) and cooks (workers). Waiters are in contact with clients and do simple tasks like providing menu or explaining if some dish is vegetarian. On the other hand they delegate harder tasks to the kitchen. Because waiters are doing only simple things they respond quick, and cooks can concentrate on their job.
Node.js here would be a single but very talented waiter that can process many requests at a time, and Apache would be a gang of dumb waiters that just process one request each. If this one Node.js waiter would begin to cook, it would be an immediate catastrophe. Still, cooking could also exhaust even a large supply of Apache waiters, not mentioning the chaos in the kitchen and the progressive decrease of responsitivity.
You don't want your CPU intensive code to execute async, you want it to execute in parallel. You need to get the processing work out of the thread that's serving HTTP requests. It's the only way to solve this problem. With NodeJS the answer is the cluster module, for spawning child processes to do the heavy lifting. (AFAIK Node doesn't have any concept of threads/shared memory; it's processes or nothing). You have two options for how you structure your application. You can get the 80/20 solution by spawning 8 HTTP servers and handling compute-intensive tasks synchronously on the child processes. Doing that is fairly simple. You could take an hour to read about it at that link. In fact, if you just rip off the example code at the top of that link you will get yourself 95% of the way there.
The other way to structure this is to set up a job queue and send big compute tasks over the queue. Note that there is a lot of overhead associated with the IPC for a job queue, so this is only useful when the tasks are appreciably larger than the overhead.
I'm surprised that none of these other answers even mention cluster.
Background:
Asynchronous code is code that suspends until something happens somewhere else, at which point the code wakes up and continues execution. One very common case where something slow must happen somewhere else is I/O.
Asynchronous code isn't useful if it's your processor that is responsible for doing the work. That is precisely the case with "compute intensive" tasks.
Now, it might seem that asynchronous code is niche, but in fact it's very common. It just happens not to be useful for compute intensive tasks.
Waiting on I/O is a pattern that always happens in web servers, for example. Every client who connects to your sever gets a socket. Most of the time the sockets are empty. You don't want to do anything until a socket receives some data, at which point you want to handle the request. Under the hood an HTTP server like Node is using an eventing library (libev) to keep track of the thousands of open sockets. The OS notifies libev, and then libev notifies NodeJS when one of the sockets gets data, and then NodeJS puts an event on the event queue, and your http code kicks in at this point and handles the events one after the other. Events don't get put on the queue until the socket has some data, so events are never waiting on data - it's already there for them.
Single threaded event-based web servers makes sense as a paradigm when the bottleneck is waiting on a bunch of mostly empty socket connections and you don't want a whole thread or process for every idle connection and you don't want to poll your 250k sockets to find the next one that has data on it.
Couple of approaches you can use.
As @Tim notes, you can create an asynchronous task that sits outside or parallel to your main serving logic. Depends on your exact requirements, but even cron can act as a queueing mechanism.
WebWorkers can work for your async processes but they are currently not supported by node.js. There are a couple of extensions that provide support, for example: http://github.com/cramforce/node-worker
You still get you can still reuse modules and code through the standard "requires" mechanism. You just need to ensure that the initial dispatch to the worker passes all the information needed to process the results.
Use child_process
is one solution. But each child process spawned may consume a lot of memory compared to Go goroutines
You can also use queue based solution such as kue