Lorawan-stack: Confirm end device session based on network server messages

Created on 8 Jan 2021  ·  8Comments  ·  Source: TheThingsNetwork/lorawan-stack

Summary

The application server should confirm the end device session (i.e. move dev.PendingSession into dev.Session) on the following events:

  • Uplink message (happens already)
  • Downlink ack
  • Downlink nack
  • Downlink failed (if the error is not unknown session)
  • Downlink queue invalidated

Why do we need this?

Currently, it is possible that the NS 'switches' the session without the AS knowing this switch occurred. @rvolosatovs reproduced this in v3.11 using the following sequence.

  • End Device joins - AS knows about this session as dev.PendingSession
  • End Device sends FPort=0 uplink - AS won't receive this uplink
  • Network Server schedules an FPort=0 downlink
  • Network Servers sends a DownlinkQueueInvalidated event to the AS
  • AS rejects this downlink queue invalidation as it doesn't have a dev.Session

At this point the AS will never be able to schedule downlinks again in this session unless the NS sends another invalidation some time in the future, because it basically rejected the FCnt increase that happened when the NS sent a FPort=0 downlink (and now the FCnt is always too low).

What is already there? What do you see now?

The session won't recover unless an invalidation occurs in the future.

What is missing? What do you want to see?

  • Add SessionKeyID to DownlinkQueueInvalidation. If the queue is empty, the AS cannot know at this moment which queue was invalidated.
  • The AS should update the queue + FCnt of the correct session, instead of always assuming that the invalidation is about the dev.Session.
  • The AS should check which session a nacked message is from - it is possible that the nacked message is from a pending session (from the AS perspective) and as such the FCnt update should be done on the correct session.
  • Finally, the AS should switch the as should session between dev.PendingSession and dev.Session when the messages mentioned in Summary occur.

Environment

v3.11

How do you propose to implement this?

  • Add the required proto field and fill it in on the NS side.
  • Check which session we're using during the invalidation/nack and make sure to update that one in the AS.
  • Move the session switching procedure out of the handleUplink and do it on all of the appropriate uplink types.
  • (Optional) return error details on DownlinkQueue{Push|Replace} with the minimum FCnt and always update the LastAFCntDown to this value. This would ensure that the system converges if at any point we're for some reason out of sync between AS and NS.

How do you propose to test this?

Try to reproduce the sequence mentioned in the reproduction steps.

Can you do this yourself and submit a Pull Request?

Yes, but as this is a non-trivial change I'm asking tagging this issue first as discussion - do we want to introduce these changes ? The downlink queue invalidation one seems a requirement, but the other ones are good for consistency.

cc @rvolosatovs

bug application server network server in progress

Most helpful comment

I have upgraded the priority to prio/high as it is affecting v3.11.1 deployments.

The changes should be the following:

  • Add SessionKeyID to DownlinkQueueInvalidation. If the queue is empty, the AS cannot know at this moment which queue was invalidated.

The following proto addition (field 3) should suffice.

message ApplicationInvalidatedDownlinks {
  repeated ApplicationDownlink downlinks = 1;
  uint32 last_f_cnt_down = 2;
  bytes session_key_id = 3 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
}
  • The AS should update the queue + FCnt of the correct session, instead of always assuming that the invalidation is about the dev.Session.

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1020-L1038

Instead of using the dev.Session always, do a switch on SessionKeyID in order to establish which session to use. If required, update the current dev.Session.

  • The AS should check which session a nacked message is from - it is possible that the nacked message is from a pending session (from the AS perspective) and as such the FCnt update should be done on the correct session.

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1060-L1073

As before, do a switch on the SessionKeyID in order to determine which session to use. If required, update the current dev.Session.

  • (Optional) return error details on DownlinkQueue{Push|Replace} with the minimum FCnt and always update the LastAFCntDown to this value. This would ensure that the system converges if at any point we're for some reason out of sync between AS and NS.

The following proto addition should be filled and be added as error details to the errFCntTooLow in the NS:

message UpdateDownlinkQueueErrorDetails {
  bytes session_key_id = 1 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
  uint32 last_f_cnt_down = 2;
}

The AS can then take these details and update the current session.

The changes are backwards compatible and hopefully minimal on the NS side. The genie is already out of the bottle - the whole protocol between AS and NS slowly became asynchronous, and simply reverting the FPort=0 change won't be enough. I don't think that this transformation was wrong at the end of the day, but we must fix these quirks regarding session bisimulation.

cc @johanstokking, @rvolosatovs

All 8 comments

I am in favor of this, as we have already discussed.
Application Server should always trust Network Server, because it always has the most up-to-date data about the device session.

  • End Device sends FPort=0 uplink - AS won't receive this uplink

Shouldn't we change this so that NS does send this, but with empty payload and FPort 0, so that it won't be sent upstream?

  • End Device sends FPort=0 uplink - AS won't receive this uplink

Shouldn't we change this so that NS does send this, but with empty payload and FPort 0, so that it won't be sent upstream?

It's redundant, since NS->AS messaging is async, queue invalidation can arrive before the FPort==0 uplink is sent to AS to confirm the session, so we have to do this anyway. The only reason to send an uplink to AS in response to FPort==0 uplink to NS, would be to ensure AS is notified of session change as soon as possible, but we don't have such need. Even then it would make way more sense to introduce a SessionSwitch message, which NS would send to AS instead of the FPort 0 uplink.

I have upgraded the priority to prio/high as it is affecting v3.11.1 deployments.

The changes should be the following:

  • Add SessionKeyID to DownlinkQueueInvalidation. If the queue is empty, the AS cannot know at this moment which queue was invalidated.

The following proto addition (field 3) should suffice.

message ApplicationInvalidatedDownlinks {
  repeated ApplicationDownlink downlinks = 1;
  uint32 last_f_cnt_down = 2;
  bytes session_key_id = 3 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
}
  • The AS should update the queue + FCnt of the correct session, instead of always assuming that the invalidation is about the dev.Session.

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1020-L1038

Instead of using the dev.Session always, do a switch on SessionKeyID in order to establish which session to use. If required, update the current dev.Session.

  • The AS should check which session a nacked message is from - it is possible that the nacked message is from a pending session (from the AS perspective) and as such the FCnt update should be done on the correct session.

https://github.com/TheThingsNetwork/lorawan-stack/blob/e2fa6c085eaaf1a0b70939020244875bd01e5857/pkg/applicationserver/applicationserver.go#L1060-L1073

As before, do a switch on the SessionKeyID in order to determine which session to use. If required, update the current dev.Session.

  • (Optional) return error details on DownlinkQueue{Push|Replace} with the minimum FCnt and always update the LastAFCntDown to this value. This would ensure that the system converges if at any point we're for some reason out of sync between AS and NS.

The following proto addition should be filled and be added as error details to the errFCntTooLow in the NS:

message UpdateDownlinkQueueErrorDetails {
  bytes session_key_id = 1 [(gogoproto.customname) = "SessionKeyID", (validate.rules).bytes.max_len = 2048];
  uint32 last_f_cnt_down = 2;
}

The AS can then take these details and update the current session.

The changes are backwards compatible and hopefully minimal on the NS side. The genie is already out of the bottle - the whole protocol between AS and NS slowly became asynchronous, and simply reverting the FPort=0 change won't be enough. I don't think that this transformation was wrong at the end of the day, but we must fix these quirks regarding session bisimulation.

cc @johanstokking, @rvolosatovs

Sounds good to me!

Anything I can do here? If so, please re-assign me and let me know what.

Anything I can do here? If so, please re-assign me and let me know what.

I'm already working on this, with a huge emphasis on the following point:

* (Optional) return error details on `DownlinkQueue{Push|Replace}` with the minimum `FCnt` and always update the `LastAFCntDown` to this value. This would ensure that the system converges if at any point we're for some reason out of sync between AS and NS.

The reason for this is that taking actions based on the events received from the NS, as part of the queue, is fundamentally not really enough:

  • Messages may be old, but we want to do fixes to the queue _now_
  • Messages may be lost - the queue in not infinite
  • Messages may be reordered - they are pushed in asynchronously into the queue, and they are popped in potentially reordered ways

Given these characteristics, there are two options:

  • Delay the queue computation to the end of the batch processing (basically if you receive 10 downlink queue invalidations, you recalculate the queue only at the end). This is not enough due to the following reasons:

    • When the AS processes a batch of messages, it still doesn't know if he is _really_ at the end of the queue, or processing some batch in the middle

    • Messages may still be lost, and determining their order is not trivial.

  • Just _trust the NS_. What I mean by this, is that push/replace operations return the dev.Session.SessionKeyID, dev.PendingSession.SessionKeyID, and the LastAFCntDown, as part of the error details. We then use the error details to rebuild the session in the AS. Fundamentally this means that when we try to do a downlink queue operation, using outdated data (perhaps an outdated session, perhaps a FCnt too low), we eventually converge to the NS state. It may take one, two, three tries, I'll make it bounded in order to avoid infinitely spinning, but we're at least operating with information that's significantly newer than the one from the uplink messages queue.
  • Just _trust the NS_. What I mean by this, is that push/replace operations return the dev.Session.SessionKeyID, dev.PendingSession.SessionKeyID, and the LastAFCntDown, as part of the error details. We then use the error details to rebuild the session in the AS. Fundamentally this means that when we try to do a downlink queue operation, using outdated data (perhaps an outdated session, perhaps a FCnt too low), we eventually converge to the NS state. It may take one, two, three tries, I'll make it bounded in order to avoid infinitely spinning, but we're at least operating with information that's significantly newer than the one from the uplink messages queue.

I think this is the best option.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

htdvisser picture htdvisser  ·  4Comments

adamsondelacruz picture adamsondelacruz  ·  7Comments

w4tsn picture w4tsn  ·  6Comments

ecities picture ecities  ·  5Comments

ecities picture ecities  ·  5Comments