I was hoping for a little clarification around running multiple gateway bridges, behind a load balancer in our case, just so I can understand things better.
Unfortunately at present, it isn’t feasible for us to install the gateway bridge on the gateways.
We are also still running v3 of everything, hoping we’ll be able to upgrade to v4 in the future, but again unfortunatley that’s not feasible in the short term for a number of reasons I won’t bore you with.
Therefore I wanted to run at least two instances of the GWB, behind a load balancer to provide some fault tolerance should one of them go down so we don’t lose any incoming data.
I have noted the section in the documentation that states they can be connected to the same MQTT broker, but if they are behind a load balancer, then you must make sure that each gateway connection is always routed to the same instance. I can do this based on source IP so should be good on that front.
I have read many other forum posts on this topic and it was mentioned the reason for this was the GWB sorted the state of the Gateway. This is where I was hoping to get a clearer understanding of how that bit works and what the affect of restarting an instance, or replacing it entirely would have on things.
I’m not clear on how the state is stored, I thought I read one forum post saying that it was via a retained MQTT message but I can’t seem to find that post again.
So the queries/concerns I have and would be very grateful for some help in clarifying are:
It would be handy to understand why the gateways must route to a specific instance, when both are connected to the same broker. I understand its around the gateway state, but I am curious to know a bit more detail around it and why the routing matters to get a clearer picture of things in my head.
How is the gateway state stored? As I haven’t seen any issues arising from replacing instances running a GWB, the gateways are still able send data even though effectively the relaunched instance is a completely fresh instance without any previous state.
If I have 2 instances running a GWB each, routing gateways to one specific instance. If that instance fails and I failover to the other, what happens with the state? and will this potentially cause me issues? I have run some tests around this (although only using a lorawan device simulator, not a real one) and it all seemed to carry on working fine but I am concerned I am missing something here.
I believe all of the “gateway state” information is stored in Chirpstack’s Postgres / Redis databases, restarting a GWB should have no effect on communication, similarly swapping to a new one shouldn’t have an effect.
I believe the gateways routing to a specific instance is primarily because of connection state management. Each instance of the GWB maintains its own WebSocket or UDP connection to a gateway. If you switch to a different instance without preserving this session, the new GWB instance would have to re-establish the connection, which could lead to missed packets or delays.
Disclaimer: I’ve never used V3 but this is my understanding in V4.
Thanks for taking the time to respond, it’s appreciated.
I spent some more time testing this and tried quite hard to break things. Flicking traffic between instances at various points and I was only able to cause problems infrequently. Although I was only sending from 1 simulated device every 45 seconds.
Most of the time, I switched traffic over and everything just kept working.
In all cases, I still received the uplink frames.
I did encounter occasional issues with the downlink, which makes sense based on what you’ve said.
The scenarios I encountered issues were:
Forcing a failover just as the uplink frame came in, going by the logs, before the downlink acknowledgement was sent. This resulted in a “no internal frame cache for token”, which makes sense.
Forcing a failover around 10 seconds or less before the uplink frame came in, this resulted in the downlink not being received, nor did it seem to work on a re-transmission. I think there just wasn’t enough time for the Gateway (or GWB, not sure which way around this bit works) to work out the connection between the two wasn’t there any more, re-establish the session and for the now active GWB to subscribe to the relevant MQTT topics. Not entirely sure why the re-transmissions still failed though.
So it seems there would be a short period of potential issues with downlinks during a failover, however uplinks should still be received.
I think this is acceptable for our use case and certainly better than only having a single instance that if it died would result in nothing being received at all.
What I haven’t tested is what would happen if there were downlinks queued that were unrelated to the uplinks/queued before the uplink and a failover occurred. I suspect more issues may occur in that scenario.
My gut tells me this shouldn’t be an issue. How it works in CS gateway OS, and I would assume all gateways, is when you queue a downlink it gets passed all the way to the gateways concentrator to be queued (not just held in the GWB until the uplink arrives) which means even if you have a failover with your GWB the downlink is already queued inside the gateway itself.
Interesting testing though overall, thanks for sharing.