Failed join when end node too close to gateway

iggarpe · October 26, 2021, 7:53pm

Hi,

This has driven me absolutely nuts but now I know what’s going on. I’m working with the end node mere meters away from our test gateway here at the office.

End node transmits join request on 868.500.
Gateway receives join request on 868.500 (RSSI -63) and on adjacent channel 868.300 (RSSI -97)
Gateway sends both join requests to network server.
Network server responds to join request on 868.500 with a join accept on 868.500.
Network server responds to join request on 868.300 with a join accept on 868.300 followed by an RX2 join accept on 869.525. This is the join that remains valid to the network server.
End node receives join accept on 868.500 and it now thinks it has successfully joined.

Surely you see the problem. To the network server the assigned network address and session key are the ones it sent last on 868.300, whereas to the end node they are the ones that it receives on 868.500.

As a result, both happily think the join was successful but communication is impossible because network address and session keys are different.

Of course if I remove the antenna from the end node every join succeeds 100% of the time, but still this is a very annoying problem.

Would there be a way for the network server to be a bit smarter about scenarios like this one?

I’m thinking, for example, about not immediately sending the join accept to the gateway but wait for a small interval (100ms?) for the same join request arriving on other channels, then select the one with the highest RSSI and ignore the rest.

Curt_Black · October 27, 2021, 3:37pm

I would have thought that join requests go through the same deduplication logic as normal uplinks, but perhaps not?

cstratton · October 27, 2021, 11:12pm

I can understand the frustration and agree it’s odd that the de-duplication is looking at more than just the packet payload and picking the report with the strongest signal, which should hopefully be the true frequency. That said there are various reports that such receiver overload sometimes causes even the on-frequency reception to be garbled and fail CRC.

You probably don’t want to remove antennas, as that creates a high SWR situation that’s hard on the transmitter electronics. What you can do is replace an antenna with a 50 ohm non-inductive resistor, a little surface mount one probably works on the node, or you can use a connectorized one on a gateway.

You may also be able to turn down the node power level - when doing initial bringup work I set things to 0 dBm as I wasn’t sure I was operating a TX/RX switch correctly yet.

Something I’ve done a lot in testing is to put some nodes as far away as I could, each hanging off a raspberry pi with remote access solution, so I could get into the pi and use that to flash or reset the node, view its serial output, etc particularly as that let me see how things behaved at a distance where a higher SF would be usable and success rates lower. Moving the gateway to an upper floor or roof machine space might be an option, too.

And then there’s the cookie/biscuit tin idea, that combined with distance should knock signal level down quite a bit. Some have more attenuation than others.

iggarpe · October 28, 2021, 11:17am

Indeed it seems to be de-duplicated by the simple means of the last join request replacing the previous one. Unfortunately the gateway sends first the true join request on 868.300.

For this corner case to be fixed, de-duplication needs to be a bit smarter and only replace an ongoing join request with a new one if less than X time has passed (1s would work) and the the new request has higher RSSI than the previous one.

iggarpe · October 28, 2021, 11:19am

Yeah, I’m aware of all those possible solutions, and I know just removing the antenna is not a good idea in general, but did the job for a quick test.

The problem is that I do not want to have to account for this end-node-close-to-gateway situation in all my deployments, and my point is that it can easily be fixed in the network server if it only replaces an ongoing join request with a new one from the same end node if it arrives shortly after (say within a 1s window) and if the RSSI is higher.

iggarpe · November 3, 2021, 10:19am

I would really like to know if there are any plans to fix this issue.

I do realize it’s a corner case, but the solution is really easy to implement.

chopmann · November 4, 2021, 8:25am

I opened a Bug report for you at: https://github.com/brocaar/chirpstack-network-server/issues/557
Please add some missing information.

iggarpe · November 10, 2021, 10:34pm

Thank you man. I commented with the extra info.

brocaar · November 15, 2021, 9:54am

This issue has been brought up a couple of times. What happens is that when a device is really close to the gateway, it will over-drive the gateway hardware causing a “ghost” package. Thus in the end the uplink is reported on two frequencies.

Currently the de-duplication logic does not inspect the LoRaWAN payload. Based on the raw payload + frequency it starts the de-duplication logic meaning that when the same payload was reported on two frequencies, there are two de-duplication functions running simultaneously and the first one “wins”.

I’m not sure of over-driving the gateway radios can cause any permanent damage to the gateway. Maybe somebody else can comment on this. My assumption is that you should try to avoid this.

As well, I’m not sure what would be the best, secure and still performant solution for this assuming this scenario doesn’t cause any harm to the gateway radios.

The reason why the de-duplication logic includes the frequency in its key is because there was a security issue reported a while ago, which would allow a re-play with a better RSSI / SNR to “steal” the downlink path. One could replay the uplink within the configured de-duplication duration using a different frequency, and with that breaking the downlink path (e.g. letting the LNS respond using a different frequency or time).

I’m open to suggestions.

martin · November 15, 2021, 11:00am

I wouldn’t expect it would harm radios even if the device will be right next to the gateay (assuming power levels used on LoRa), but don’t take my word for it - it sure is not correct (or expected) operating condition.

What I do for testing gateway in office is simply swap gateway’s antenna for 50ohm RF load. This gives me much more realistic RF numbers and I barely see any more ghost packets, while it maintains correct operating conditions with very little effort.

cstratton · November 28, 2021, 5:49pm

Since both requests have the same device join nonce, only one can be valid. Accepting a join needs to ultimately be a locked operation, maybe the validity of the nonce needs to be double checked inside the lock?

That still leaves the issue of the “ghost” join request potentially being responded to rather than the real one

The reason why the de-duplication logic includes the frequency in its key is because there was a security issue reported a while ago, which would allow a re-play with a better RSSI / SNR to “steal” the downlink path. One could replay the uplink within the configured de-duplication duration using a different frequency, and with that breaking the downlink path (e.g. letting the LNS respond using a different frequency or time).

Splitting the replay attack from the real packet doesn’t seem like it can actually protect against it though, because whichever is processed first should claim the frame count. Is the LoRaWAN spec replay attack protection of requiring that the frame count increment being suspended in such case by also letting those run as a race condition?

My feeling is if protection against this very rare sort of actual replay attack is desired, then it needs to be done by splitting the candidates later, not by weakening the dedupe.

So I’d propose de-duplicating on the raw encrypted uplink packet alone, so that application and join servers get a single feed and make their decisions only once.

In the case where a downlink is going to be generated, then at the point of routing that is where splitting into multiple packets might be barely appropriate for this kind of protection.

Reviewing the uplink reports for dowlink routing candidates, within any gateway, any report which is more than a few LoRa symbols later in its timestamp on that gateway than any other can be discarded, either because it would probably be a replay attack, or from the simple reality that trying to respond to it would be a transmit scheduling collision with at best the later losing, or in some packet forwarders potentially both losing.

But if there’s a frequency with usable RSSI reported in by a distinct gateway that didn’t report any already responded frequency then and only then, the downlink packet could be duplicated and assigned to that path as well. (Multiple uplink SF’s would need to be considered too, since frequency diversity is not the only form of “stealing” that could be done). Distinct-gateway downlink stealing would probably require that an attacker be sophisticated enough to be monitoring near one of your gateways, and have a fast path over to a transmitter located near a different gateway, but if you imagine someone wants to attack you badly enough, then that’s within the realm of possibility too.

Also consider that it’s not just frequency or SF that could be used to steal a downlink: at SF’s where a whole packet airtime fits in the de-duplication window (which could be pretty large if using a 5 second rx1 to allow for slow backhaul), someone could blast the replay back at the gateway with a few watts of power on the very same frequency and SF, and steal the downlink just on the basis of timing, which demonstrates that neither including the frequency or the SF in the deduplication actually offers sufficient protection.

Finally note that without some at least approximated unified packet timing between gateways, it’s not possible to protect against different-gateway (same frequency/SF) timestamp downlink stealing - without some sense of unified timing from GPS or recent history of mutual packets, it’s not possible to tell which timestamp is first, or even if attempts to transmit both responses would overlap and jam each other on the air.

In summary my overall argument is that hedging bets by responding to multiple frequency or SF candidates is something that should be done at the stage of downlink routing, not at uplink de-duplication. No doubt that’s more software work, but it seems like the only place where protection against a rare malicious act can be safely implemented in a way that not only covers all variations of the attack, but doesn’t break the LoRaWAN spec’s implicit protection against much more common innocently accidental situations such as receiver overload.