Processing gateway error time out for waiting for connection

running chirpstack after a whole night, i found all the uplink data was lost from yesterday night (the webUI show the gateway last seem at yesterday night); i check my network and it is available.
my devices are trigger with schedule uplink, it seems like just suddenly disconnected.

i restart gateway but cannot connect to chirpstack, i try restart chirpstack docker and finally it connected.
could anyone tell me what happened?

here are the docker logs from chirpstack docker(at midnight)

edit:
i got the same error again, here are the error logs start

@brocaar could you help or guide me to trace the error?

This looks like a Redis connection issue. Have you checked your Redis logs?

redis logs looks nothing abnormal.

here are the redis logs when the error occured

docker logs --since ‘2022-09-09T12:25:05.999999999’ --until=‘2022-09-09T12:29:05.999999999’ 79
1:M 09 Sep 2022 04:25:49.381 * 10000 changes in 60 seconds. Saving…
1:M 09 Sep 2022 04:25:49.382 * Background saving started by pid 86
86:C 09 Sep 2022 04:25:49.386 * DB saved on disk
86:C 09 Sep 2022 04:25:49.386 * RDB: 0 MB of memory used by copy-on-write
1:M 09 Sep 2022 04:25:49.482 * Background saving terminated with success

I seem to have encountered the same issue. Running chirpstack v4 using a slightly modified docker compose (but no changes to the redis section), overnight it looks like chirpstack lost connection to… something - perhaps redis? Everything was proceeding as normal, and then all of a sudden we start to see timeout errors. No other logs show anything strange at that time.

I think redis logs around that time look normal

chirpstack-docker-redis-1  | 1:M 21 Sep 2022 01:43:18.077 * Background saving terminated with success
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 01:59:15.864 * 100 changes in 300 seconds. Saving...
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 01:59:15.864 * Background saving started by pid 61
chirpstack-docker-redis-1  | 61:C 21 Sep 2022 01:59:15.872 * DB saved on disk
chirpstack-docker-redis-1  | 61:C 21 Sep 2022 01:59:15.872 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 01:59:15.965 * Background saving terminated with success
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 02:59:16.075 * 1 changes in 3600 seconds. Saving...
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 02:59:16.075 * Background saving started by pid 62
chirpstack-docker-redis-1  | 62:C 21 Sep 2022 02:59:16.082 * DB saved on disk
chirpstack-docker-redis-1  | 62:C 21 Sep 2022 02:59:16.082 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
chirpstack-docker-redis-1  | 1:M 21 Sep 2022 02:59:16.177 * Background saving terminated with success

UPDATE: it appears that restarting redis (docker compose restart redis) allowed chirpstack to reconnect, but it’s still somewhat concerning.

I was able to connect to redis while this issue was happening, and it didn’t look like there was anything going on that would have caused the timeout. I see that there are 100 open redis connections that are made between chirpstack and redis, but this doesn’t seem to be bumping up against any limits as far as I can tell.

i try to restart chirpstack itself (docker compose restart chirpstack) will reconnect success, but i keep facing the same problem from chirpstack v4; no matter in rc version or v4 release version.

i try to use redis locally(not in chirpstack docker, on my local machine), the problem still occuried.
i don’t have any idea to deal with.

my test env:
I have two gateways, two devices, connect to a chirpstack instance, but the two devices send at high frequency (about 5s once), after a night I can see about 228 devaddrs in redis.

@brocaar Maybe you have some more insights into how redis is being used. In debugging I checked my client list when everything was blocked and saw every connection looked like this (all the connections have xread as their last command):

id=171 addr=172.18.0.8:38826 laddr=172.18.0.3:6379 fd=76 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=173 addr=172.18.0.8:38830 laddr=172.18.0.3:6379 fd=78 name= age=22776 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=175 addr=172.18.0.8:38834 laddr=172.18.0.3:6379 fd=80 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=176 addr=172.18.0.8:38836 laddr=172.18.0.3:6379 fd=81 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=178 addr=172.18.0.8:38840 laddr=172.18.0.3:6379 fd=83 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=179 addr=172.18.0.8:38842 laddr=172.18.0.3:6379 fd=84 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=180 addr=172.18.0.8:38844 laddr=172.18.0.3:6379 fd=85 name= age=22776 idle=1 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=181 addr=172.18.0.8:38846 laddr=172.18.0.3:6379 fd=86 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=182 addr=172.18.0.8:38848 laddr=172.18.0.3:6379 fd=87 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=239 addr=172.18.0.8:39106 laddr=172.18.0.3:6379 fd=107 name= age=21576 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=240 addr=172.18.0.8:39108 laddr=172.18.0.3:6379 fd=103 name= age=21576 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=161 addr=172.18.0.8:38806 laddr=172.18.0.3:6379 fd=66 name= age=22776 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=83 addr=172.18.0.8:38406 laddr=172.18.0.3:6379 fd=88 name= age=23392 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1
id=84 addr=172.18.0.8:38408 laddr=172.18.0.3:6379 fd=89 name= age=23392 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=61488 events=r cmd=xread user=default redir=-1

I then issued a CLIENT KILL TYPE normal to kill all the connection and chirpstack did reconnect and all my new connections look like the following, and the interface is once again working:

id=391 addr=172.18.0.8:42176 laddr=172.18.0.3:6379 fd=38 name= age=165 idle=165 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=4 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20488 events=r cmd=NULL user=default redir=-1
id=392 addr=172.18.0.8:42178 laddr=172.18.0.3:6379 fd=39 name= age=165 idle=165 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=4 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20488 events=r cmd=NULL user=default redir=-1
id=393 addr=172.18.0.8:42180 laddr=172.18.0.3:6379 fd=40 name= age=165 idle=165 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=4 argv-mem=0 obl=0 oll=0 omem=0 tot-mem=20488 events=r cmd=NULL user=default redir=-1

@zeara One question for you: when you’ve found this issue, have you had your browser open on one of the either gateway or device “frames” pages? I just realized that I’ve been sitting on those pages polling - I wonder if over time something is happening internally and those connections aren’t getting killed properly and just end up blocking.

yes, i open the loraframe’s or event’s page even for a whole night; sometimes i forget to close sometimes i uss it to monit data uplink

@zeara I’m going to try leaving mine closed and see whether I end up with the full client list again and the timeouts.

@brocaar This continues to happen to us. Our current configuration has nginx as a reverse proxy in front of Chirpstack using grpc_pass for the API calls. When watching the frames page for one of our gateways, I would hit a minute with no traffic coming through the gateway, or no payloads from a device (if we were on that page), and nginx would close the connection due to a timeout.

022/09/22 21:18:53 [error] 19#19: *623 upstream timed out (110: Operation timed out) while reading upstream, client: xx.xx.xx.xx, server: chirpstack.example.com, request: "POST /api.InternalService/StreamDeviceFrames HTTP/2.0", upstream: "grpc://172.18.0.8:8080", host: "chirpstack.example.com", referrer: "https://chirpstack.example.com/"

My suspicion is in these cases, nginx never receives a response from the grpc call (because it’s waiting on traffic from the gateway, which isn’t returning in the 60s default) and then is closing the connection to the client saying that it’s closed. But perhaps Chirpstack is not closing the connection once nginx does, and so it’s just keeping that connection to redis open, waiting for frames to come in to send upstream (which is now gone).

Is there a way to ask a running Chirpstack instance how many grpc calls it’s currently serving? Or alternately, figuring out a way for Chirpstack to let go of these redis connections once the proxying connection goes away?

Thanks for all the feedback. Based on the above info, I will try to reproduce this issue at my side again. Given that all the connections are using the xread as their command looks like this is related to the streaming logs and events.

I have found the issue and will issue a bugfix release soon. Thanks for all the input :+1: