Multiple GWB instances in Kubernetes

Jake_Murphy · January 14, 2025, 4:30pm

ChirpStack itself. I’m actually bringing up the HA discussion on a lot of these topics because we’re looking to scale up. I can confirm it works with Cisco CPF.

It’s a Cisco IXM

Just to say though, once we got it working, another team took it over, so over the last 4 years I’ve just leaked knowledge

sp193 · January 14, 2025, 4:58pm

Anyway. GWB does not do much as it is just an adaptor. You could attempt scaling GWB, but I don’t think you would really get limited by having one instance. I would scale Chirpstack itself instead, as the more threads you have for processing incoming data, the better parallelism.

Jake_Murphy · January 14, 2025, 6:02pm

In fact we already have 3 pods for application server, network server and mosquito, however we got some orders that we’ll be scaling rather aggressively this year, so want to be sure it’s prepared to handle as much load as possible.

I was wondering what kind of load can the GWB handle reliably without us needing to consider scaling?

Liam_Philipp · January 14, 2025, 6:16pm

Perhaps SP has some general numbers for you, but if you have a testing server you could use and the time to set it up, you could try the Lorawan Simulator to do some proper stress testing with simulated gateways and devices.

Jake_Murphy · January 15, 2025, 10:30am

Hmmm, that does remind me that I had the ChirpStack Simulator setup to do the initial POC. I’ll hunt down that code and see if it still works. Could give me the data I need to prove the case one way or another. Thanks for the suggestion!

sp193 · January 15, 2025, 2:28pm

I haven’t personally dealt with several thousand active devices. In the cases we did, the LNS operation was outsourced.

But I have some idea. LoRaWAN devices don’t send data very often, so you can make some assumptions.

How do you define “handle reliably”? I believe it depends on the project’s requirements. If you do not expect downlinks to be sent in response to uplinks, it is very much simpler. LoRaWAN is weaker at downlinks as the SX1301 is half-duplex and can only modulate one frame at any time.

GWMP involves a few types of inbound messages. You can approximate the traffic:
PULL_DATA: Heartbeat from GW (but not what Chirpstack uses to determine online/offline status). Sent once every few seconds, by every GW. Which is not a lot.
PUSH_DATA: Uplink data, from the gateway. May contain multiple messages, but each would contain only one in the worst-case scenario.

So if you have 1 million nodes that send data every 30 minutes:

the worst case, when the devices are bad and do not stagger, is 1 million requests in one instance. And then silence until the next interval. You will likely have problems with collision on your uplink channels.
best case: the devices stagger evenly. 1m/1800s=555/s

In the best case, a modern PC with a 3GHz clock rate should be able to more than handle it. I developed the MQTT (not related to Chirpstack nor LoRa) component of my product and noted that unconfirmed 30000 messages/s was possible if your network can do it. MQTT runs atop TCP, which is heavier than UDP datagrams. I program in Java.

If you somehow manage to receive 1 million datagrams in the worst case, you could get dropped datagrams without tuning the Linux kernel settings because of buffer space. If it can accept all frames and they get cleared fast enough, then the next problem may be whether the system can meet the RX1 window timings because of the deeper parts of Chirpstack and RTT across all links.

Jake_Murphy · January 16, 2025, 1:21pm

Thank you for such a detailed dive into what you did previously. We’re looking at ~150k devices flowing data for this year. But even if that number grows - through nature of the work we are involved in, and physical space limitations, I think it’s impossible for them to sync and push all at the same time. Really great to have these kinds of numbers though to benchmark in theory.

Just out of curiosity, if we were to vertically scale the GWB resources, do you have any statistics of usage? We’re deploying into Kubernetes so would love to know if you had a ballpark figure for CPU/RAM that was consumed, as we can request more, but need to be able to justify it.

Also, wondering how many gateways you had in your situation?

sp193 · January 17, 2025, 1:41am

Honestly, I never looked at the load statistics. After the deployment was done in the project mentioned above, I stopped having access to the backend. Today, I see that are 106 online gateways with 1019 nodes.

My own management indicated that we’ll not take up the role of a LoRaWAN carrier, so I don’t normally have deployments as large as this one. This is for an island-wide LoRaWAN network. We have a very dense urban area, which is complicated for LoRaWAN deployments.