Hello,
we operate a small network. Everything runs on Kubernetes in Google Cloud, and now I am trying to switch to HA solution. Chirpstack and Redis already have HA, Mosquitto is currently running as an MQTT broker, which does not have HA and will be replaced by something else in the next phase.
I am currently testing the transition from UDP to Basic Station at Gateway Bridge and extend it to multiple pods (instances) at Kubernet.
In the configuration, I have set Shared subscription as follows:
command_topic_template=“$share/gwbridge/eu868/gateway/{{ .GatewayID }}/command/#”
With 3 pods, it works. However, I tried increasing the number of pods to 10, and there are outages and device rejoins. In most cases, join accept is not delivered to device, and everything gradually disconnects.
The forum usually recommends using Gateway OS or MQTT Forwarder on GW for HA mode. We use Mikrotik devices in our network where this is not possible.
Does anyone have any recommendations on how to further adjust the settings from the default? Is my approach correct? If the current solution is not powerful enough, I would not be able to scale the number of pods, I would have to create a new Gateway Bridge.
We chose Mikrotik because of the price, reliability and easy configuration of their devices. I am also testing SenseCap M2 with Gateway OS, but SenseCap M2 did not convince me that it would be better, for our use it is worse. So we would like to continue using Mikrotik.
Liam_Philipp, thank you very much, that was very good point.
Google LoadBalancer does not have this feature enabled by default. It is necessary to set “SessionAffinity” to “ClientIP”.
I’m testing this settings now, it looks like the problem has been solved, I’ll let know the result.
Increasing the number works. However, a similar problem occurred when reducing the number of pods. This problem was solved by setting container-native load balancing. If anyone else is interested, some informations here: Container-native load balancing | GKE networking | Google Cloud
The annotation for the setting is: cloud.google.com/neg: '{"ingress": false}'
Two strange situations occurred during testing (probably a bug on the Mikrotik side). In one case, the lora interface in Mikrotik was turned off. In the second case, it even disappeared from the list. It is a marginal random problem that I will solve with snmp monitoring.
So now I think the HA setup should work properly. Thanks for the guidance.
I’m still struggling to get a stable setup on GKE. The system is only stable when I scale the chirpstack-gateway-bridge down to a few replicas. As soon as I increase the number of replicas, problems appear after a few hours.
Devices send a JoinRequest, but the JoinAccept is often not received. When a device does manage to join, subsequent confirmed uplinks often fail to receive an ACK, forcing the device into a re-join loop.
I understand that a shared subscription on the command topic will not work, because it distributes downlinks randomly and not to the specific pod that holds the WebSocket connection for that gateway.
Question is - is it even possible to run the gateway bridge (for basic station) in a HA mode with multiple replicas? Has anyone successfully deployed it this way? To understand whether to look for a problem in the GCP infrastructure (load balancer settings), or whether my entire assumption of how it works is wrong.