I have a Chrome container (deployed using this Dockerfile) that renders pages on request from an App container.
The basic flow is:
- App sends an http request to Chrome and in response receives a websocket url to use (e.g.
ws://chrome.example.com:9222/devtools/browser/13400ef6-648b-4618-8e4c-b5c73db2a122
) - App then uses that websocket url to communicate further with Chrome, and to receive the rendered page. I am using the puppeteer library to connect to and communicate with the Chrome instance, using
puppeteer.connect({ browserWSEndpoint: webSocketUrl })
;
For a single Chrome container this works really well.
But I'm trying to scale things up to have multiple Chrome containers in a Docker swarm.
The problem is, I think, that the websocket url received by App is specific to the instance running in that particular Chrome container, so when it is used by App (and where there are now multiple Chrome containers), the websocket requests from App will not necessarily be routed to the right Chrome container.
What is the best way of dealing with this?
You’ve got the basic design correct, but the issue you’re experiencing is with session “stickiness”. However, instead of trying to re-route subsequent requests back to the appropriate machine, we should look for a way to avoid the "pre" request.
The best way to do that is to have your Chrome docker image man-in-the-middle all http “upgrade” requests. This http action is what all WebSocket connections emit prior to changing protocols including the puppeteer library (which is just a WebSocket client under-the-hood). Doing this will also obviate the need for a pre-connect call since the proxying to Chrome will happen on upgrade vs exposing a URL for the app to use. Here's a pretty basic example of doing this with the http-proxy module:
There's other benefits with this approach as will: you can limit things like concurrency and even inject scripts to be ran at a later time. Those require a little more though and preparation, but the overall idea remains the same. This also makes load-balancing trivial since there's not need to make routing sticky.
If this is something you're interested in implementing all that works is largely done for you in the browserless repo. It even allows for things like concurrency limitations, session time limitations, and includes a feature-rich IDE. You can find more docs on that project here.