socket.io vs RethinkDB changefeed

Currently I'm using socket.io without RethinkDB like this:

Clients emit events to socket.io, which receives the events, emits to various other clients, and saves to the db for persistence. A new client connecting will get existing data from the db then listen to new events over socket.io.

How would switching to RethinkDB and the changefeed help me here?

The way I see the same working with RethinkDB is the client could do a POST (which inserts into RethinkDB) instead of emitting to socket.io, and then socket.io is watching a RethinkDB changefeed and emitting to all clients when it receives new data.

How is this method using RethinkDB and the changefeed better than my current method? To me they both feel like they accomplish the same thing, but I don't see any obvious advantage in the RethinkDB method, and because I'd be going to the db rather than emitting from socket.io on the server straight away it will surely be a bit slower.

First, let's clarify the relationship between socket.io and RethinkDB changefeeds. Socket.io is intended for realtime communication between client (the browser) and the server (Node.js). RethinkDB changfeeds are way for your server (Node.js) to listen to changes in the database. The client can't communicate with RethinkDB directly.

A very typical architecture for a realtime app is to have RethinkDB changefeeds subscribe to changes in the database and then use socket.io to pass those changes to the client. The client usually also emits messages which can get written to your database, depending on your application logic.

Yes, you could just emit all messages through socket.io then pass all messages to all clients, and then just write those messages to the database for persistence. It's also true that this is faster, but there are a number of disadvantages to this approach.

1. Database as single source of truth

The easiest problem to spot is the following:

What happens if your app isn't able to write something to the database?
What happens if the data you're trying to insert into the database is invalid or a duplicate? Do you write application logic to handle this?
What happens if the Node.js server goes down before sending out the write query?

These are just some quick examples in which, because of your architecture, you will lose or have out-of-sync data. And just to reiterate this, you WILL lose data, because your main source of truth is in-memory. You might also have discrepancies between the data in your Node.js app and your DB.

The point is that the database should always be your single source of truth and you should only acknowledge data when it's written to disk. I'm not sure how anyone would be able to sleep at night otherwise.

2. Advanced Queries

If you just pass all new messages from all clients to all clients through socket.io, you now have to have some pretty complex logic in your client in order to filter out all the data that's actually important. Take into consideration that you're passing a lot of useless data through the network that the client won't actually use.

The alternative is writing a pub/sub system in which you subscribe to certain channels (or something like that) in order to filter out the data that's actually important to the client.

RethinkDB solves this by providing it's very own query language that you can attach to changefeeds. If the client, for example, needs all the users in my users table between the ages of 20 to 30, that live in the state of California, 10 miles from San Francisco, and who have bought a book within the last 6 monhts, this can be expressed in ReQL (RethinkDB's query language) and a changefeed can be setup for that query, so that the client only gets notified when relevant changes. This is much harder to do with just Socket.io and Node.js.

3. Scalability

The last problem that RethinkDB solves is that it's a much more scalable solution to just storing everything in memory (through Socket.io and Node.js). Because RethinkDB is built from the ground up to be distributed, you can have a cluster of 20+ RethinkDB nodes with shards and replicas. Every RethinkDB query you write is distributed by default. Now, you can have 20+ other Node.js nodes that are stateless and are all listening to changfeeds. Because the database is the central source of truth, this is not a problem.

The alternative would be to limit yourself to one server, have some other pub/sub system (built on something like Reddis, for example), have only a single database that you poll... There's probably more examples, but you can see where I'm going with this.

I'd love to hear if this answered your question and if I'm getting where you're coming from. It's a little hard to get how to structure your applications at first, but it really is an elegant solution for most realtime architectures.