LoRa Meters Offline after 180 Days – Possible Configuration or Certificate Issue

dimatus · December 2, 2024, 11:25am

We are using LoRa devices, specifically DZG Meter electricity meters, which are connected via a Kerlink PDTIOT-IFE00 iFemtocell Gateway to the Chipstack Server V4. However, we have recently encountered an issue: some of the approximately 30 connected meters have stopped transmitting data and are now showing as offline. Notably, this issue occurred exactly 180 days after the devices were first commissioned.

What is particularly striking is that most of the meters stopped transmitting data at precisely the same time. A few other meters exhibited the same behavior exactly 24 hours later. Since then, these devices have remained offline and are no longer sending any data.

Our initial suspicion points towards a time-dependent configuration, such as expiring settings or certificates. Could anyone assist with this issue?

We are unsure if this is related to the configuration, the firmware, a central setting in the gateway, or the server. Is there anyone with experience in similar problems or with suggestions for potential solutions? It would be especially helpful to get guidance on how to analyze or test this behavior to resolve the issue and prevent it from recurring in the future.

Liam_Philipp · December 2, 2024, 1:42pm

Share the Chirpstack logs. If the uplinks are making it to Chirpstack we would be able to tell the problem there. If not it is likely a gateway related issue.

Do you use certificates in your deployment?

dimatus · December 3, 2024, 11:05am

Thank you for your response! We do indeed use TLS certificates for the gateways. Could you please specify which logs you require? We would be happy to provide them accordingly.

Thank you in advance for your support!

Liam_Philipp · December 3, 2024, 1:55pm

The Chirpstack logs themselves. How you get these depends on the install you did.

If it was docker you can use:
docker compose logs -f

If Ubuntu/debian you can use:
sudo journalctl -f -n 100 -u chirpstack

Have you checked your certificate expiry?

And are your gateway bridges installed on gateway or on server?

It would make sense if the certificates were on the gateways (with gateway bridge installed) and those timed out. Causing a disconnect of different sets of devices at different times.

Rimo · December 3, 2024, 4:17pm

Hi, thanks for your help.

I’m working together with @dimatus.

Our Chirpstack is running in docker. The thing is that we have multiple other gateways which are running correctly (for now). So we have a lot of logs from every gateway.

To answer your other questions:

The certificates are longer valid then 180 days. And our gateways are not even in use for that long.

We installed the ChirpStack MQTT Forwarder on our gateways.

Yes, our certificates are installed on the gateways.

Liam_Philipp · December 3, 2024, 4:22pm

Are you getting any errors in the Chirpstack logs? You can grep for errors or warnings on the log output. Ideally if you can (if you’re not disrupting production) restarting Chirpstack and monitoring the logs often helps find connection errors, but seeing as you have other gateways that are working correctly the issue is probably not with Chirpstack but with the gateways/devices themselves.

Are any gateways showing offline? Or is it only the devices themselves that stopped communicating?

If gateways are showing offline, does the iFemtocell have logging of its own? Checking the logs of the offending gateways would be the way to go if so.

Really the goal is to find where in the chain of device → gateway → MQTT broker → Chirpstack the uplinks are getting lost.

Rimo · December 3, 2024, 4:40pm

Our gateways are in production so we had to switch the old gateway. The old gateway was showing online. We still can see some join request from yesterday before we switched the gateways.

When I grep for errors (they are a few days old) I get the following:

mosquitto-1  | 2024-11-26T02:24:28.859747078Z 1732587868: OpenSSL Error[0]: error:140360C7:SSL routines:ACCEPT_SR_CERT:peer did not return a certificate
mosquitto-1  | 2024-11-26T02:24:28.859777888Z 1732587868: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:49.762858450Z 1732680289: OpenSSL Error[0]: error:1402542E:SSL routines:ACCEPT_SR_CLNT_HELLO:tlsv1 alert protocol version
mosquitto-1  | 2024-11-27T04:04:49.762906014Z 1732680289: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:50.148975951Z 1732680290: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:50.149006193Z 1732680290: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:50.497348689Z 1732680290: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:50.497371484Z 1732680290: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:50.835645032Z 1732680290: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:50.835676960Z 1732680290: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:51.195628450Z 1732680291: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:51.195660476Z 1732680291: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:51.544808759Z 1732680291: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:51.544842652Z 1732680291: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:51.887321597Z 1732680291: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:51.887387975Z 1732680291: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:52.241729918Z 1732680292: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:52.241760948Z 1732680292: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:52.608221827Z 1732680292: OpenSSL Error[0]: error:14FFF0C7:SSL routines:(UNKNOWN)SSL_internal:peer did not return a certificate
mosquitto-1  | 2024-11-27T04:04:52.608255085Z 1732680292: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T04:04:52.899338821Z 1732680292: Client connection from 142.93.248.86 failed: error:1402542E:SSL routines:ACCEPT_SR_CLNT_HELLO:tlsv1 alert protocol version.
mosquitto-1  | 2024-11-27T04:04:53.078598788Z 1732680293: Client connection from 142.93.248.86 failed: error:1402542E:SSL routines:ACCEPT_SR_CLNT_HELLO:tlsv1 alert protocol version.
mosquitto-1  | 2024-11-27T09:05:52.378995294Z 1732698352: OpenSSL Error[0]: error:02FFF068:system library:func(4095):Connection reset by peer
mosquitto-1  | 2024-11-27T09:05:52.380516659Z 1732698352: Client auto-69AC374B-0ED4-C0AE-DB76-E2F380D09A70 disconnected: Protocol error.
mosquitto-1  | 2024-11-27T11:24:16.860251962Z 1732706656: OpenSSL Error[0]: error:140360C7:SSL routines:ACCEPT_SR_CERT:peer did not return a certificate
mosquitto-1  | 2024-11-27T11:24:16.860294514Z 1732706656: Client <unknown> disconnected: Protocol error.
mosquitto-1  | 2024-11-27T11:27:57.351338428Z 1732706877: Client connection from 13.58.97.162 failed: error:1402542E:SSL routines:ACCEPT_SR_CLNT_HELLO:tlsv1 alert protocol version.

I don’t have the gateway with me right now, so I have to wait for it to check it’s logs.

Liam_Philipp · December 3, 2024, 4:44pm

When you are saying the “old gateway” does that mean that it was the gateway itself with issues? And now you have replaced it and the devices are uplinking fine again? It’s interesting that the gateway was still showing online, as that means the MQTT forwarder is still sending stats events to the Broker, which would suggest that it is not that part of the chain failing, although the fact that replacing it would fix the issue and the SSL errors would suggest the gateway was the problem.

Join requests in general, from that gateway, or from the failed devices?

Those errors you shared though imply a certificate issue, although it’s still not clear what is actually failing to connect. Do the IPs 142.93.248.86 and 13.58.97.162 mean anything to you? Also do the errors persist until you swap gateways, or was it just this small blurb?

When you do get your hands on the gateway again, try openssl s_client -connect <MQTT-BROKER-IP>:8883 -cert <client-cert.pem> -key <client-key.pem> -CAfile <ca-cert.pem> with the certificates on the gateway. Then you will be able to confirm whether or not it was the certificates that caused the issue.

Rimo · December 3, 2024, 5:14pm

Yes, with the new gateway everything looks to be working again.

From the failed devices.

I have to check where these IPs are from.
Good point, I will check the MQTT connection with the gateway certificates when I have it.

Liam_Philipp · December 3, 2024, 5:28pm

Very strange.

I don’t know what to make of this.

Considering you could still see the join-requests coming in from the failed devices (and presumably not error-ing) means the uplink flow was still working, and at the very least the gateway was working to communicate that (assuming no other gateway was close enough to receive the join-requests).

Can you see any of Chirpstacks join-accept responses? Considering the join-requests made it okay, I’d look towards if the join-accept made it to the devices next if you really want to get to the source of why communication stopped.

Rimo · December 12, 2024, 4:52pm

Sorry for the wait.

So I had a look into the gateway logs and searched for errors and anything unusual but couldn’t find something that could be the problem.
I checked the SSL connection with the certificates from the gateway but that wasn’t the problem.

There are a few things I saw one the chirpstack UI:
Some of our devices have OTAA errors. But these don’t always happen at the same day as when we have the “loss” of uplinks. And are already happing for few months. So i would say this is different problem.

E.g. this is one device over one year:

And here is the same device over the last 31 days:

There are JoinAccepts from chirpstack.
But they seem like to be stuck in a “join-process” sometimes:

Also there are multiple JoinRequest/JoinAccepts and UnconfirmedDataUp/UnconfirmedDataDown back to back on a few devices.
Does this maybe have to do with our qos?

Liam_Philipp · December 12, 2024, 7:36pm

Honestly could have just been an issue with the gateways transmitting capabilities.

In my experience the issue of join-looping like that is typically because the frequency bands are mis-aligned, or the device simply isn’t receiving the join-accepts for whatever reason.

Seeing as the rest of your setup is fine and the devices were previously working, it’s not a band-mismatch. Also the communication between Chirpstack and the Gateway was working consistently for uplinks so the messages were not getting lost IP wise. The devices are working with a new gateway so it’s not a device issue. Seems the most likely answer to me is that the gateway was failing to transmit.

Perhaps it could be a QOS issue, but then I’d expect to see a bunch of disconnections in the gateway bridge logs, and you’d think atleast the occasional messages would get through rather than all device just going offline.

Rimo · December 13, 2024, 12:29pm

I guess it has to be a problem with join-looping.
e.g: Right now we have a different device that fails to join. Although chirpstack shows JoinAccepts.

What QOS would you recommend?

Liam_Philipp · December 13, 2024, 3:56pm

I’m honestly not sure, I’ve never played with those settings before and changing them beyond zero might have additional concerns.

QOS 1 - Which guarantees the message is received at least once, has a chance of creating duplicates.

QOS 2 - Which guarantees the message is received exactly once, introduces latency which could further affect downlinks.

You could try implementing 2 and see if it solves the join-looping problem, but I doubt it. considering there are 10+ joins in a row, this isn’t simply the occasional packet getting lost in the MQTT pipe, because then you’d only see the double join once, at most twice.

I’d find the gateway that device is uplinking through and verify its frequencies and MQTT topic prefix are correct. When you look at the joinAccept vs joinRequest are they both in the expected eu868 frequencies.

Rimo · December 13, 2024, 4:28pm

Yes, they are both in the same frequency.
Sometimes both are on 868500000 or 868100000

Liam_Philipp · December 13, 2024, 4:37pm

Is it only the one device join-looping or are there multiple?

Honestly not sure what could be causing the problem at this point.

Rimo · December 13, 2024, 4:45pm

There a multiple devices with this phenomenon.

It’s also weird that the problem vanishes after some time and that the devices work like nothing happened (at least for most devices). But this “fix” can happen like after one hour or much later.

The problem also happens on some devices which are active for like 3 months. So the problem shouldn’t have anything to do with 180 days.

It does feels just random right now.

Liam_Philipp · December 13, 2024, 4:55pm

Likely the devices are set to auto-rejoin on some schedule. Or else whenever they fail to receive acknowledgment for their uplinks or “link testing” messages. This would be my guess as to what triggers the issue, the question is what is the actual issue with the join process though. Again, you have no OTAA errors in the logs? I would have thought that the OTAA errors you are seeing in the bar graphs are directly error messages in the logs but perhaps its just if the device sends two joins back to back it assumes an error.

Rimo · December 13, 2024, 5:11pm

We know that these device do an automatic rejoin every 24 hours.

Can the OTAA errors really be the problem if they are not happening in the same day as these join-process errors happen?
Like we saw in the graph before. We had days with some join-problems before but there were no OTAA errors shown.

But I’ll have a look again.

Liam_Philipp · December 13, 2024, 5:20pm

Fair point, if you look at the device that is join looping now does the bar graph have OTAA errors? If not then you are right and OTAA errors is not the problem.

I am definitely getting to the end of my knowledge here. Could it be rx window related? Could it be LoRaWAN MAC version related? Perhaps the devices are just faulty? All just shots in the dark from here. How far on average are these devices from the gateways? Could be that the sensitivity of the antenna on the device isn’t good enough to receive the join-accepts, but the sensitivity of the antenna on the gateway can receive the join requests. I’ve seen join looping myself from that. But then you’d expect to get high packet loss as well which I don’t believe you are seeing.