Redis: connection pool timeout (code: 2)

Hi friends,
I have lunched multiple network servers, multiple application servers on single machine. after some weeks, I see below error when I want to see a gateway data:

I am having the same problem. Any luck figuring out a solution?

A timeout occurs when the client is unable to connect within a reasonable amount of time to the (Redis) server. Either Redis is unable to respond to the request or there is a network issue causing the connection to fail.

Have you ever found a solution?

Hello. I see the same problem, in the Application server as well as the Network server, and they seem to be influencing each other:
redis-cli info | grep -i "client" gives me

# Clients
connected_clients:5
maxclients:10000
client_recent_max_input_buffer:20480
client_recent_max_output_buffer:0
blocked_clients:1
tracking_clients:0
clients_in_timeout_table:0
mem_clients_slaves:0
mem_clients_normal:68764
evicted_clients:0

when everything is working fine and idle. But when I open many Application Server tabs, e.g. /frames or /data, the numbers of connected_clients and blocked_clients are raising (by pretty much one per tab), until it hits a threshold at

connected_clients:40
blocked_clients:38

At exactly the same time, the Application server starts showing “redis: connection pool timeout (code: 2)” as a black notification in the lower left corner of a browser window and the Network server starts showing a lot of

Okt 21 08:55:59 lora chirpstack-network-server[29219]: time="2022-10-21T08:55:59.585312729+02:00" level=error msg="gateway/mqtt: acquire lock error" error="acquire lock error: redis: connection pool timeout" key="lora:ns:stats:lock:b827ebfffedbad45:23427c74-8aa9-4047-b5bb-8bbba07ba595" stats_id=23427c74-8aa9-4047-b5bb-8bbba07ba595

in sudo journalctl -f -x -u chirpstack-network-server.service | grep -i "error".

When I open even more browser tabs, the errors become more frequent and the redis seems to hit a ceiling at

connected_clients:41
blocked_clients:40

At the same time, the “Last seen” on /#/organizations/x/gateways becomes “a minute ago” instead of “a few seconds ago”

When I close the browser window containing all those Chirpstack tabs, the client numbers slowly decreases a bit but stays high (e.g. connected_clients:37,blocked_clients:33 after ~30 minutes) , the errors in the network server logs stop. If I open just some more new tabs, the errors start again, as soon as blocked_clients becomes 40.
I have had situations in the past, where the errors started in the night, when a lot of devices wake up and transmit and did not stop until I issued a sudo systemctl restart for redis, chirpstack-application-server or chirpstack-network-server (can’t be sure which of those did the trick).

I did some more experiments including service restarts:

  • restart chirpstack-application-server: connected_clients:4, blocked_clients:0 (when the browser window with all the tabs is still open, the number raises again step by step)
  • restart chirpstack-network-server: connected_clients:20, blocked_clients:16 (with the browser window still open, then later rising back to connected_clients:39, blocked_clients:36)
  • restart redis: connected_clients:3, blocked_clients:0

The VM, on which both the application server and the network server are running is as follows:

lscpu | grep "CPU(s)"
CPU(s):              2
  • chirpstack-application-server 3.17.8
  • chirpstack-network-server 3.16.5
  • Redis 6:7.0.5-1rl1~bionic1
  • Ubuntu 18.04.6 LTS

I see in the documentation of both Network and Application server

# Connection pool size.
#
# Default (when set to 0) is 10 connections per every CPU.
pool_size=0

and tried to play with it a bit, but neither small values like 2 (made the problem much worse) or high like 100 for pool_size remedied the problem.

@brocaar I see two issues here:

  1. In my opinion, a connection pool timeout of the application server shouldn’t influence the network server. Essentially this would mean, that any user opening “too many” browser tabs would block the whole LoRaWAN network. Also, it’s unclear to me, why the restart chirpstack-network-server halfs (40->20) the client connections. I would have thought, the Network server only has 1-2 connections open while the Application server hogs. Are they maybe somehow “sharing” a pool, e.g. by using the same ID or so?
  2. It looks like the servers, at least the Application server, is not re-using client connections a lot (I think somewhere I read the term “client connection leak” much like “memory leak”). Maybe some tweaking in the timeout times can be done? I googled a bit and found some resources:

Please tell me where I might be mislead or what I have done wrong or how I can provide more information or otherwise help in debugging and fixing this.

Cheers

1 Like