Internal timeouts not reconnecting on overload/temporary disk problems

andershedberg · January 20, 2020, 8:15am

Hi, just logging a problem we saw this weekend. If someone else experience it, it might be worth looking deeper into.

This weekend I got a call that no Lora devices was working. Logging in to the Loraserver I noticed that the packages were coming in from the gateways ( showing as staples when looking at gateway tabs ) and registered in the Lora-server on the gateway-pages, but not registering on the individual devices and also not transmitted to the external applications.

Due to the remote setup I could not login to the machine and examine the logs, just check the webpages it shows and the last hour seen by vmware ( it’s a virtual machine running on vmware esxi). It seems that the server might have experienced long disk I/O access due to a backup job at the time it stopped working correctly and maybe some raid problems (no warnings of disk problems though). There was also a failure of some kind with the backup snapshot not beeing removed correctly according to vmware info. The server never went offline and all webstuff kept on running ( we monitor the server so we know it never went down/rebooted ) but I suspect the disk problems might have caused an internal overload/timeout of some connections inside the lora-server.
These connections ( this is my theory ) was not restored automatically once the disk problems solved itself and thus no traffic was forwarded.

Rebooting the machine solved the problems and everything came up again working.

Server used. Very old one, can’t find any info. Most likely release 1.0 something. About 50 gateways and 4500 sensors. ( not all active, but registered on the server ) Has been up and running without any problems for about 2-3 years.

promero · January 21, 2020, 6:46pm

Hi @andershedberg

I just had a similar problem. I realized today that I was not receiving data from any of my gateways since three days ago (last Saturday). Fortunately I have the Chirpstack Packet Multiplexer in some of them, sending data to TTN and also to my Chirpstack Servers. In TTN I was receiving well but not in Chirpstack.

I did not find anything in the logs:

journalctl -u chirpstack-application-server -f -n 50
journalctl -u chirpstack-network-server -f -n 50

The solution: restart the services.

Did you or anyone find another solution? How could we prevent this situation?

Kind regards,
Pablo Romero.