Scaling

How to scale Resgate to millions of concurrent users

Resgate and the RES protocol has been developed with scaling in mind. There are many dimensions in which a system might grow, and different types of growth might require different solutions. In this document, we will be looking at different ways infrastructures using Resgate may be built to scale.

Client connections

There are limits on how many concurrent connections a single server can handle. As the user base grows, the capacity to accommodate the increasing number of connections must grow with it.

The simplest single-instance setup of Resgate may handle 10000+ concurrent clients, depending on hardware. The below illustration shows such a setup:

Simple RES Network Diagram
A setup with a single Resgate instance may serve 10000+ concurrent clients.

To increase the capacity, client’s may be load balanced over multiple Resgate instances connected to the same NATS server. A setup with double the client capacity could look like this:

Load Balanced RES Network Diagram
Running two load balanced Resgates will double connection capacity.

RES protocol allows multiple instances of Resgate to be connected to the same NATS server. New instances may be added to an already running system, and existing instances may be taken down or restarted without the clients experiencing any disruption.

NATS server has no hardcoded limit on how many Resgate instances may be connected, but should be able to handle 10000+ Resgate connections, depending on hardware. With 10000 Resgates serving 10000 clients each, a setup like this could in theory scale to 10M concurrent client connections. In the best of all possible worlds.

To increase beyond the capacity of a single NATS server, and to add redundancy, we must look into horizontal scaling.

Note

Most types of load balancing approaches work, such as DNS load balancing, Layer 3 / 4 / 7 load balancers, or some kind of orchestrator service.

Horizontal scaling

While ordinary REST APIs may be scaled horizontally by replicating a service instance, this is also possible with Resgate, with some limitations. To quote from the specification1:

Resource events are sent for a given resource, and MUST be sent by the same service that handles the resource’s get and call requests.

This means that, connecting multiple instances of a service to a single NATS server is only permitted by the RES protocol in case the resources served are static. In all other cases, multiple service instances responding to get requests and sending resource events on the same resource, would introduce race conditions which would ultimately lead to inconsistent state in the clients.

There are two ways we may get around this limitation in order to scale horizontally:

Sharding

If a single service is under heavy load, but may not be broken down into smaller micro-services, it might be an option to shard the service instead.

By running multiple instances, shards, of the same service, where each instance is responsible for a subset of the resources, then there will be no violation of the protocol. This is possible as long as the service that responded to a get request for a specific resource, also is the service that subsequently sends any event for that resource.

Multiple clusters

By running multiple instances of our entire setup of NATS + Resgates + Services, we may continue to scale our setup, and in addition gain benefits such as high-availability and redundancy. A multi-cluster setup could look like this:

Multi Clustered RES Network Diagram
Running two clusters of NATS + Resgates + Services synchronized over a message queue.

In the illustration above, two NATS servers run with three micro-services each, represented by the Java, Go2, and Elixir icon. Each NATS server has two Resgate instances running.

Note

If needed, clusters may share services, as illustrated by the shared Elixir service, which listens to requests from both NATS servers.

Any changes that occur in one of the services should be synchronized with the same service in the other clusters. This can be done using a message queue or similar.

Because of the delay in synchronization, clusters might be slightly ahead or behind each other, for certain resources. In cases when a client reconnects from a Resgate in one cluster to a Resgate in another cluster, the client would not experience any trouble as it would automatically be resynchronized.

Conclusion

Resgate and the RES protocol is built to be scalable. And while scaling might not be as easy as simply cloning a running service instance, it is capable of growing to handle millions of concurrent users.

This document has only generally described the limitations and possibilities of scaling up your RES API. In reality, what your solution will look like depends entirely on your specific needs.


  1. RES Service Protocol Specification - Resource events ↩︎

  2. The Go gopher was created by Takuya Ueda and designed by Renee French↩︎