Gateway downtime and activation lost

Hi all.

We have a bunch of sensors installed in a remote building. When we left the building a few months ago, the sensors were successfully transmitting data to the server. Then the gateway got unplugged and plugged back in after a few weeks. Now we only receive a few of the sensors.

Looking at the live Lorawan frames in the gateway, it seems the sensors are still transmitting and reach the gateway, but the messages are just ignored.

I deactivated the frame counter check. Anyway, I don’t think it makes a difference because with a message every 20 minutes I don’t think we reached the maximum value. Also, we don’t get frame counter errors and it is clear they have been deactivated, as the “activation” tab reads

This device has not (yet) been activated.

We’re still trying to figure out why we’re in this state and how to get out of it.

My understanding (from this thread) is that devices are responsible for rejoining when receiving too many errors / not enough acks. Perhaps the devices we’re using don’t do that, which would be a device issue, I suppose. The devices we use there are Ursalink UC11-T1 and EAGLE 1500(80).

Since we got data from those sensors in the past, they must have been working at some point in Chirpstack. Looks like the sensors still use the devAddr they were provided during the last join, but somehow Chirpstack lost trace of it. How would that be possible? Does Chirpstack forget active devices after a delay? Reading comments around this one, I understand Chirpstack server reboot shouldn’t affect OTAA. Besides, there are sensors we are still receiving.

Anyone has a clue what we could do that wouldn’t involve restarting each sensor manually? (That would be an issue because remote building + sensors in private areas.)

Should this situation be blamed on the devices not joining when being ignored, or is there anything we could have done wrong?

Any hint welcome. Assuming we manage to get physical access to all sensors to reboot them, the whole system seems very week if it can’t handle a gateway power or connection loss so there must be something we’re missing, here.

Thanks.

Check device_session_ttl in the network server configuration. If the gateway was unplugged for multiple weeks, that was probably your issue. And yes, devices should rejoin after enough failed confirmed uplinks, otherwise you end up in the position of requiring a truck roll.

Thank you for your help.

The device_session_ttl parameter is the answer to my question.

We left the 744h default. That’s 31 days. The outage started on December 1st and I have a few sensors back on January 7th, after the TTL. That would explain why we lost the sensors. How come some are still there ? And how come they didn’t all come back at the same time ? I don’t know. Looks like a few of them managed to reboot, somehow. I don’t think I can access logs showing if they joined again or if Chirpstack was nice enough to let them in despite the exceeded TTL.

At this point, I’m blaming it on the hardware. I shall contact the vendors to at least know how their sensors should behave.

Thanks again.

1 Like

There’s a couple of questions/confirmations I’d like to ask to be sure I understand correctly.

The “LIVE LORAWAN FRAMES” tab in the gateway page displays frames from devices that are still active. I can see those frames in the devices’ “LIVE LORAWAN FRAMES” tab as well. It also displays frames from inactive devices. This is where we noticed lost devices are still transmitting.

The gateway being in forwarder mode, it transmits everything it gets to the network server, even data from unknown devices.

Does the “LIVE LORAWAN FRAMES” tab display frames from unknown devices as well? I’d say yes because once a device has become inactive, it is the same as an unknown device. We just didn’t spot unknown devices in the logs, so I’m in doubt. Perhaps because there are no other LoRaWAN devices around.

The frames received from the inactive devices are displayed there as is but since the device is inactive, they are not decoded, just dropped, and the device doesn’t get a confirmation for the uplink. Is this correct? (This seems relatively clear to me but I figured it was worth a getting it confirmed because if I’m getting this wrong, then there’s something I really must be missing.)

Correct. You can also configure the gateway bridge to filter by NetID if you want to minimize uplink traffic, though in this case it would still show your expired devices because they have the same NetID.

Correct.

Live frames is basically a view of everything the gateway sees (again, barring filtering).

Crystal clear.

Thanks.

And thanks for the NetID tip.

Is there anything we can do to fix this remotely. Same issue here and I don’t want to fly across the country to our remote location and reset 160 sensors. PLEASE I hope there is a fix. The activation has to be saved some place we can restore from a distance.

If you don’t mind doing doing some scripting, then you can extract this info from the databases:

chirpstack_ns=> \d device_activation;
                                         Table "public.device_activation"
     Column      |           Type           | Collation | Nullable |                    Default                    
-----------------+--------------------------+-----------+----------+-----------------------------------------------
 id              | bigint                   |           | not null | nextval('device_activation_id_seq'::regclass)
 created_at      | timestamp with time zone |           | not null | 
 dev_eui         | bytea                    |           | not null | 
 join_eui        | bytea                    |           | not null | 
 dev_addr        | bytea                    |           | not null | 
 f_nwk_s_int_key | bytea                    |           | not null | 
 s_nwk_s_int_key | bytea                    |           | not null | 
 nwk_s_enc_key   | bytea                    |           | not null | 
 dev_nonce       | integer                  |           | not null | 
 join_req_type   | smallint                 |           | not null | 
chirpstack_as=> \d device
                                      Table "public.device"
               Column                |           Type           | Collation | Nullable | Default 
-------------------------------------+--------------------------+-----------+----------+---------
 dev_eui                             | bytea                    |           | not null | 
 created_at                          | timestamp with time zone |           | not null | 
 updated_at                          | timestamp with time zone |           | not null | 
 application_id                      | bigint                   |           | not null | 
 device_profile_id                   | uuid                     |           | not null | 
 name                                | character varying(100)   |           | not null | 
 description                         | text                     |           | not null | 
 last_seen_at                        | timestamp with time zone |           |          | 
 device_status_battery               | numeric(5,2)             |           |          | 
 device_status_margin                | integer                  |           |          | 
 latitude                            | double precision         |           |          | 
 longitude                           | double precision         |           |          | 
 altitude                            | double precision         |           |          | 
 device_status_external_power_source | boolean                  |           | not null | 
 dr                                  | smallint                 |           |          | 
 variables                           | hstore                   |           |          | 
 tags                                | hstore                   |           |          | 
 dev_addr                            | bytea                    |           | not null | 
 app_s_key                           | bytea                    |           | not null | 

I think eventually I want to implement a better solution for this in the upcoming ChirpStack v4 version (might not be directly included in v4.0.0, but it has been requested several times).

We are currently investigating also strange behavior. The device_session_ttl is still set to 31 days and I checked e.g. the redis keys timeouts which are currently almost at the same level.
But we sometimes found some sensors where activation info had disappeared (e.g. after a 3 days gw outage). Are there any other cases where the device activation info is cleared before the device session timeout is reached?
And do I understand the DB tables info correctly, that we could save the info from there as a backup periodically to be able to restore it in case of an error? E.g. I checked some Redis keys (which seems to be not updated since several days = timeout < 30 days) and I can not find the matching devices with the webapp search.
And does this implementation changed with v4? It is possible to implement a similar fallback solution there? (we will switch to v4 in the future)

EDIT 2: We are using network-server v3.16.5
EDIT 3: I checked the device_activation table. The entries are there but the redis entry is missing. I searched for “lora:ns:devaddr:<dev_addr_as_hex>”
EDIT 4: Current status is probably strange: I have two not working examples with different states:

  • both devices have uptodate entries in the device_activation table (ns)
  • first device: no redis entries for lora:ns:device: and lora:ns:devaddr: prefixes
  • second device: valid redis entry for lora:ns:devaddr: but no one with lora:ns:device: prefix
  • no corresponding device-session deleted log entry

we probably found the cause: we hit maxmemory settings from redis in the past and didn’t recognized this. We are not 100% sure but this probably is the best answer for our strange behavior (some keys are there, some are missing).

Is it possible to restore the required data stored in redis? Is it enough to create entries for the lora:ns:device: and lora:ns:devaddr: prefixes?

1 Like

I can answer my question by myself: I don’t know why I was asking the question :sweat_smile: because it is already known that redis device session content can not be restored or reconstructed. So In our case those devices must be rejoin manually

1 Like

Hello, has the problem been resolved in version v4?

It cannot be that the devices have to be reset locally, in a wireless technology.

In a worst-case scenario, devices should detect that they are no longer “connected” and rejoin the network (which would establish a new device-session in case it was lost).

1 Like

We lost connectivity to our MQTT server for several days and now that everything is back up, the gateways are chiming in, but every one of our 2500 sensors across the US, is now saying “This device has not (yet) been activated.”

Assuming you’re on a pre-4.8 version, what was your device session TTL set to? The default should be way more than “several days”.

After the last incident in May 2022, we up’d the TTL to 1 year.

We have had outages in the past due to power, at the local gateway sites… but this time our MQTT server went a little crazy and caused this whole debackle. The gateways are reporting in now, but all sensors lost activation. Running on ver3