Monitoring message delay and missing down messages (ACK)

How do you monitor how long it takes to process a message for a single device or whether ACK down messages are missing? We use the ConfirmedUplink-feature and if the delay from the sensor to the network server reaches the rx1_delay setting (1s), no ACK-down message is sent out. It would be nice if there were some numbers in prometheus, e.g. a 99th percentile of receive window times or a record of slow devices. The goal would be to see if the distance to rx1_delay is too small and whether there is a risk, if the network delay increases in the future, that ACKs cannot be sent.
a possible small change would be if there is some logging that network delay was too high and ACK message is not sent out properly. Currently we only see missing ACK messages in the live dashboard (errors does not occur, we only can see, that after a confirmed uplink is received, no downlink is added in the same second, so automatic monitoring is currently not possible).

I believe these kind of errors are already exposed? Have you checked the error event type? (Event types - ChirpStack open-source LoRaWAN<sup>®</sup> Network Server)

Thanks! We will check the events, currently we ignore all non-up events.
Should these errors be part of the statistical data, like the gateway transmission chart or in the prometheus data from the networkserver?

We have changed the topic subscription but currently we only receive up-events (e.g. chirpstack.34.device.[deveui].event.up). The gateway bridge is deployed to the cloud. If we add delay (>750ms) to the udp packets from LoRa gateway to the bridge, no error occur and no ACK message is sent out as reply.

The code part is probably here:
It is possible that timeout handling is not catched and ACK message is skipped too? How can we debug this, should we add debug messages and build a new package by our self? Are there any other simpler approaches?
EDIT: in addition I saw that all amqp/integration messages are logged too. Our log backup contains only some rare errors, which can also be seen in the gateway transmission chart
EDIT2: because we have many sensors with missing ACK messages, the log must be full of errors :wink:

More results:
it seems that the code block from above is never reached for our class a sensors.
the gateway bridge (in the cloud) receives the downlink, but will not send the downlink message back to the redis stream. From network server perspective probably all is working fine.
we are now analyzing the bridge code to see in which situations the message is not added to the redis device message stream. Probably an error handling for this case is missing because the bridge should send such errors back to the network server.

Sorry for spamming :wink:
We missed that there are ack mqtt messages. On networkserver side those ack-messages are only printed out as a successful message “backend/gateway: downlink tx acknowledgement received”

We have added debug logging to this place and if downlink ack message is missing, the ack message contains “TOO_EARLY” in the error field. But it seems that this error will not be processed further.

we tracked it down to

FPort is always nil in our case so all messages will be ignored. Why is the FPort relevant for sending the error to the AS?

@brocaar should we raise a bug for this topic? probably 1 or 2 different topics.
fport is always null also if downlink ACK message is sent out successfully. I found other fport=null examples from other users in the forum (e.g. Device sends ConfirmedUp but no UnconfirmedDown is initiated)
So fport check here and at are wrong and/or fport must be set elsewhere also on ACK messages

This code block checks if the downlink was an application payload (thus fPort > 0), which is not the case. In your case it is a mac-layer only downlink.

To summarize the issue, I think this describes it:

In case a mac-layer downlink (for example an ACK of a confirmed uplink) can not be scheduled by the gateway (e.g. too late / early), these errors are not included in the statistics and logs.

Is that correct?

1 Like

Thanks for clarifying! yes, mac layer downlink errors are missing (stats, logs, integration events)

I tried to fix the problem:

1 Like