Catching gaps in Uplink

ektanoor · December 25, 2017, 12:24pm

We have a very peculiar problem here. We have a some hundreds of devices. Some two hundred of them are a just crazy - they seem to go sleeping for days, they reset quite frequently and there are even suspicions that they choose transmission parameters that don’t go right with the protocols. In all it is a mess that one would like to readily replace, but we can’t do it, at least for the moment. One thing that help us - catch gaps in fCntUp.

So, I tried to create a fCntGap variable inside loraserver and lora-app-server and tried get it out on “Device Activation Page”. It works well and stable on “sleeping” or transmission gaps, yet something goes wrong when we try to catch the resets. On loraserver it works fine down to the logs, however, when it goes to lora-app-server, the page does not refresh. I have roughly seen why, yet I could not catch all the reasons on how this occurs.

So my question is. Is it possible to get a way to publish fCntGap at “Device Activation Page” every time gaps occur? Or is it better to create a separate tab and work separately of everything else? In any case I would like to see suggestions on how to do it on lora-app-server as I am not an expert on Go.

In any case the changes on code are here:

brocaar · December 27, 2017, 11:30am

Wouldn’t it be better to add a kind of notification / integration, based on the fCnt field in the MQTT / HTTP messages? In this case you don’t have to make any LoRa App Server modifications and backport your changes each time when I issue a new release

ektanoor · December 27, 2017, 6:36pm

Sadly no. Such approach is good for nearly individual cases. We had such cases in the past btw. In our present case we need to retain even more than we are doing now - get an history of resets and gaps that may have occurred in a week or a month. As I noted, we have a problem in the hundreds in the middle of some hundreds more. Backports ain’t an issue at such volumes. Unfortunately I don’t get with the place things get blocked.

brocaar · December 28, 2017, 9:02am

I haven’t completely reviewed your code, but doesn’t it only show the gap between two frames?

Time | FCnt
0    | 10
1    | 11
2    | 15
3    | 16

So only at time 2 you’ll see a packet-loss of 3 frames, at time 3 this is set back to 0 as there is no packet-loss between time 2 and 3.

As the FCnt is already exposed in the .../rx MQTT topic (and HTTP integration), I think it would be better to implement it at your application side. E.g. you could write this data into a time-series database like InfluxDB (or even a regular database).
That would allow you to perform any kind of calculations on it. E.g. you could combine this data with other metadata like data-rate and see if there are correlations.

ektanoor · December 29, 2017, 10:42am

Yes if you have one application. But we have now 9 apps, two brocaar servers, two non-brocaar servers, some 12 gateways and hundreds of devices spread in various forms and volumes through some 9 places (not counting pure Lora stuff in some sort of Dragino or similar + a TTN link). No it is not easy to follow such problem on the app side, specially if it is not fully your app. Sometimes I am forced to transfer end-devices to one non-brocaar server (specifically Petr Gotthard’s) to follow their behavior as there we can have directly some statistics brocaar servers still lack of.

ektanoor · December 29, 2017, 10:45am

And yes we name “brocaar servers”

brocaar · December 29, 2017, 1:24pm

Why is my proposal limited to one application? You can implement a “fCnt gap” application subscribing to application/+/node/+/rx and it will see data of all your nodes and applications.

ektanoor · December 29, 2017, 10:04pm

Again, somehow, if working on one single environment. No, we don’t have one single environment and in the near future it will be even more isolated. Besides, this pushes the problem into the “dominion” of “other departments”, to those who mainly work on database processing and aggregation. These people worry more on how much the counts fit the bills and less on checking if a modem is delivering wrong readings. They will not be happy to hear us asking “do we have a gap or not” in some electrical or gas counter. Besides, these people are averse to electronics and whatever happens around it. A gap for them is “doesn’t work”. It is “No it does not work and I cannot trust the data and it goes again and again and there is nothing to be done until they fix it!”. That’s what happens when you push such a technical problem up to that level. No, this is not a solution. I need raw data at gateway<->server level (even gateway only, like in Multitech’s stations), I need some statistics there, I have to know where things start to go wrong and not guess where, in miles of connections, things start to fall apart.

Besides, MQTT itself is not a good guarantor of safe delivery and I had some serious incidents with that.

Ok I got the idea that we have some very specific to our work here. I am not asking you to change the code to suit our needs. I just wanted to know where things get blocked and as I have a near zero practice on Go, I thought you could tell me where I should look at. Sorry for bothering you with this, but we’ll try our way.