Chirpstack 4 :Loose some Uplink during Burst Mqtt Gateway reconnection

guillaume_fernandez · March 15, 2023, 10:45am

When a gateway come back after sonme inet problem, it send a lot of Mqtt up message to our main brokers, retransmitting the burst to our chirpstack4 server.

Everthing is OK when the inet problem is short, on 10 up message, 10 are catched by by chirpstack

But when the inet problem is longuer, in my case (1400messages), only 400 are sent to my application

I put a ‘mqtt scope’ (with a different clientid) between our main mqtt broker and chirpstacl, and I saw 1400 messages, (all the messages are send within few seconds)

I looked at chirpstack log, I see only ~400 upstream event.
I looked at our broker mosquitto log, and I saw that each 1400 Publish to Chirpstack are acknoledged.

I seems that chirpstack forgot some upstream events when dealing a burst.

I dont see any Error in the chirpstack log (TRACE)

Does somebody encouter this problem ?
Maybe there is a setting to do (redis setting…), or other

The mqtt config is chirpstack 4,

# MQTT configuration.
      [regions.gateway.backend.mqtt]
qos=1
clean_session=false
client_id="chirpstack_regioneu_picstloup"

guillaume_fernandez · March 15, 2023, 1:46pm

We have seen that some mqtt’s message on the application side contains something about " Frame-counter reset or rollover detected"
My understanfing is that up message do not arrive in the correct order, then, chirpstack skip thems.

So to test i put the device in mode : “Disable frame-counter validation”, to avoid that.

My first test is a little test, but only 1 msg / 10 were missed.

I need to do a real test to see if there is a best behavior.

But my opinion is that is does not solve completly the problem.

I suspect something else

guillaume_fernandez · March 15, 2023, 3:57pm

There is something else. .
Here my test.
A device is sending a frame every 60seconds, in very good Tx/Rx conditions.

I cut the inet of my gateway. from 14H52 => 16H00:
we should get 69 messages, actually we get on 45, there is a lost of ~ 20 messages

72 incoming MqttMessage from the backend mqtt

I think that 72 is the sum of our catched mqtt frame (45) + 25 external devices + 2 join external join request

grep " Message received from gateway region_config_id" selected_cs_log | grep "event\/up" |  wc -l
72

grep -i warn selected_cs_log 
WARN up{deduplication_id=4f362755-e86a-4a6b-b323-a9940cf7e087}:data_up: chirpstack::uplink::data: No device-session exists for dev_addr dev_addr=1f89270e
WARN up{deduplication_id=5cc9c071-3e23-45d3-b835-625e6f45cb3d}:data_up: chirpstack::uplink::data: No device-session exists for dev_addr dev_addr=260b16ec
...
grep -i warn selected_cs_log | wc -l
25

grep "ERROR" selected_cs_log 
mars 15 15:00:48 picstloup chirpstack[3003054]: 2023-03-15T15:00:48.752325Z ERROR up{deduplication_id=faf57b99-cee2-48f7-b900-63e0ee5d5db7}: chirpstack::uplink::join: Handle join-request error error=Object does not exist (id: 100bc50c00007530)
mars 15 15:00:48 picstloup chirpstack[3003054]: 2023-03-15T15:00:48.776345Z ERROR up{deduplication_id=4c453466-4676-45d0-8496-ec08dc8f1f7c}: chirpstack::uplink::join: Handle join-request error error=Object does not exist (id: 100bc50c00007530)

grep "ERROR" selected_cs_log  | wc -l
2

…

We publish 45 messages to the mqtt integration chanels (It is what I get in my application)

grep "chirpstack::integration::mqtt: Publishing event"  selected_cs_log |  wc -l
45

I have supected something around the deduplication, so I set the delay to 0ms

=> /etc/chirpstack/chirpstack.toml

# Network related configuration.
[network]
  # Network identifier (NetID, 3 bytes) encoded as HEX (e.g. 010203).
  net_id="000000"
  # Time to wait for uplink de-duplication.
  #
  # This is the time that ChirpStack will wait for other gateways to receive
  # the same uplink frame. Please note that this value affects the
  # roundtrip time. The total roundtrip time (which includes network latency)
  # must be less than the (first) receive-window.
  deduplication_delay="0ms"

But there is no influence…

And I don’t find any other idea

datnus · March 16, 2023, 1:30am

May be MQTT Max inflight has some impact?

guillaume_fernandez · March 16, 2023, 1:20pm

Thanks for the idea

I have set Max inflight to only 1.
On the broker of Gateway (configured as bridge to our main mqtt broker) and on the main mqtt too.

The basic idea is so to slow down the mqtt burst to chirpstack.
Intermediate results seems to be better, but not perfect. I will wait the test of the night.

I see a "turnaround’ solution. I would be to add a specific mqtt ‘observer’ forwarder between the main broker and chirpstak. It role would be to forward immediatly any message except messages received within a small time window. For this kind a messages, probably a ‘burst’, forward them spaced by 200ms by example.

Another point : I think that I don’t catch big error of chirpstack, like a stackoverflow, or some other exceptions in the syslog. Where to find it ?

Just an example, in my several experimentations to solve this problem, I set rise a big value to redis_max_connection.
max_open_connection=1000.

I restart chirpstack, and the log was OK. But when I tried to go the the website, => Nothing.
Apache tell me

] [proxy_http:error] [pid 3014346] (70007)The timeout specified has expired: [client XXX.XXX.98.156:50172]

So clearly chirpstack was stalled, because of this setting, but I can’t get any information in the logs about the problem.

So I suppose that during a MqttBurst, a failure is coming, and nothing is logged…

My chirpstack is : chirpstack 4.3.0, my system debian 11.

bconway · March 16, 2023, 2:01pm

I would investigate MQTT ordering, as it is likely off by default, and would need to be turned on in all servers and clients. Even then, client drivers may not guarantee maintaining ordering in an offline (QoS = 1) state. I believe I ran into this with the Gateway Bridge at one point.

You may also be seeing the effects of at-least-once delivery and not actually have 1400 unique messages.

guillaume_fernandez · March 17, 2023, 10:38am

The mqtt ordering is solved from my point of view, by using “Disable frame-counter validation”, and max_in_flight.

THe global aim of this stuff, is to be able to receveived stored dataframe, when Inet is broken. Some of our Gateways are in inet hard condition.

The result of the test of night is not to bad. but no perfect.

So set Mqtt Max inflight to only 1, on the local broker and the main broker (its slow down clearly the mqtt flux, which seems to be a good point for the Chirpstack server)
deduplication_delay=“0ms” - Override the default value (200ms)
Disable frame-counter validation to avoid rejected frame in case of no good order.

During the night real test on 960 expected frame, we received 882, failure rate : 8%

We have created a test bench replaying the same mqtt as mqtt message by a simple bash shell, by a sequential manner, and by a multiprocesse manner, to do some stats and experimentation.

Sequential :

mosquitto_pub -u agspXXXXXr -P "XXXXXX" -q 1  -h <ourmqttbroker> -t aXXXXXn/ns/gateway/0080000000024764/event/up -m '{"phyPayload":"QLJcTwEAnQ4BB6JNaDro/gv1x/pzPMD0qxQuBGv2cARlUf9GBg4=", "txInfo":{"frequency":86
7100000, "modulation":{"lora":{"bandwidth":125000, "spreadingFactor":9, "codeRate":"CR_4_5"}}}, "rxInfo":{"gatewayId":"0080000000024764", "uplinkId":44220, "time":"2023-03-16T16:21:54.545889070Z", "rssi":-69, "snr":12.8, "channel":3, "rf
Chain":1, "context":"l9uIZA==", "crcStatus":"CRC_OK"}}'

mosquitto_pub -u xxxxxx -P "xxxxx" -q 1  -h ourmqtt.server.fr  -t aXXXXXn/ns/gateway/0080000000024764/event/up -m '{"phyPayload":"QLJcTwEAng4Bqni6j+50Ht0ecipLGoOyFQB6sF5HJ/EPVEQmLqw=", "txInfo":{"frequency":86
8300000, "modulation":{"lora":{"bandwidth":125000, "spreadingFactor":9, "codeRate":"CR_4_5"}}}, "rxInfo":{"gatewayId":"0080000000024764", "uplinkId":9568, "time":"2023-03-16T16:22:54.499896832Z", "rssi":-67, "snr":12.8, "channel":1, "con
text":"m25YLA==", "crcStatus":"CRC_OK"}}'

Multi process :

mosquitto_pub -u agspXXXXXr -P "XXXXXX" -q 1  -h <ourmqttbroker> -t aXXXXXn/ns/gateway/0080000000024764/event/up -m '{"phyPayload":"QLJcTwEAnQ4BB6JNaDro/gv1x/pzPMD0qxQuBGv2cARlUf9GBg4=", "txInfo":{"frequency":86
7100000, "modulation":{"lora":{"bandwidth":125000, "spreadingFactor":9, "codeRate":"CR_4_5"}}}, "rxInfo":{"gatewayId":"0080000000024764", "uplinkId":44220, "time":"2023-03-16T16:21:54.545889070Z", "rssi":-69, "snr":12.8, "channel":3, "rf
Chain":1, "context":"l9uIZA==", "crcStatus":"CRC_OK"}}' &

mosquitto_pub -u xxxxxx -P "xxxxx" -q 1  -h ourmqtt.server.fr  -t aXXXXXn/ns/gateway/0080000000024764/event/up -m '{"phyPayload":"QLJcTwEAng4Bqni6j+50Ht0ecipLGoOyFQB6sF5HJ/EPVEQmLqw=", "txInfo":{"frequency":86
8300000, "modulation":{"lora":{"bandwidth":125000, "spreadingFactor":9, "codeRate":"CR_4_5"}}}, "rxInfo":{"gatewayId":"0080000000024764", "uplinkId":9568, "time":"2023-03-16T16:22:54.499896832Z", "rssi":-67, "snr":12.8, "channel":1, "con
text":"m25YLA==", "crcStatus":"CRC_OK"}}' &

The results are :

Sequential failure 0%
Multiprocess failure 66%

Note that redis max connection have no influence at all on the failure results in multiprocess… We keep 10.

pnoon · March 17, 2023, 12:07pm

may be an item to investigate ?
Paho-mqtt client input stream buffer size is set to 25.
Is it possible that this has an impact and that some messages are lost because the input stream of paho is full ?
In this case, it seems thtat the message_callback will not performed whereas the message has been acknowledged?

    // get message stream
    let mut stream = client.get_stream(25);

guillaume_fernandez · March 20, 2023, 3:52pm

We try to increase this value to 1000 in a self compiler chirpstack version

It is better

client.get_stream(1000) in a self compiled chirsptack version
So set Mqtt Max inflight to only 1, on the local broker and the main broker (its slow down clearly the mqtt flux, which seems to be a good point for the Chirpstack server)
deduplication_delay=“0ms” - Override the default value (200ms)
Disable frame-counter validation to avoid rejected frame in case of no good order

On big ~1500 messages during the WE, we retrieved ~1400 messages : something like 7%/8% of failure.

Maybe it can be a paramer… even if we don’t succeed to detect the failure in the log,

pnoon · March 21, 2023, 3:40pm

update of paho_mqtt to [v0.12.1] introduced unbounded get_stream option with

get_stream(None)

Recompiling chirpstark and changing the code in “gateway/backend/mqtt.rs” for :

// get message stream
let mut stream = client.get_stream(None);

Then we performed a burst test with 1150 messages in few seconds is now OK and no message is lost.

datnus · March 22, 2023, 1:18am

@pnoon Awesome.
What does the line of code mean?

get_stream(None)

pnoon · March 22, 2023, 1:53pm

Hi Datnus,
As my explanations can never be as good as those of the paho_mqtt documentation for rust, I prefer redirecting you to the doc.
paho_mqtt get_stream

datnus · March 22, 2023, 4:33pm

@pnoon Thanks a lot.

guillaume_fernandez · March 24, 2023, 11:13am

Just to conclude on our choice, we succeed to get a good behavior
On 1300 frame => 0% of failure, during mqtt burst

Self compiled chirpstack with paho_mqtt to [v0.12.1], and let mut stream = client.get_stream(None); in “gateway/backend/mqtt.rs** ”
set Mqtt Max inflight to only 1 in our brokers, to preserve the mqtt stack a little bit in qos=1
deduplication_delay=“0ms”, setting 200ms (the default value). By keeping the default value, chirpstack goes in a special behavior when a mqtt burst occurs. its looks stuck, for example the web us does not answer (time out). And there is nothing strange in the logs.
Disable frame-counter validation – Normally by using Max inflight to only 1, the order is preserved, but… we prefer to do it.

Redis max connection has no impact on this subject.

system · June 22, 2023, 11:13am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.