Lorawan-stack: Class C messages intermittently fail to schedule with invalid absolute time

Created on 17 Nov 2020 · 15Comments · Source: TheThingsNetwork/lorawan-stack

Summary

Class C messages intermittently fail to schedule with "invalid absolute time set in application downlink" when closely spaced.

Is there a minimum time between messages to the same device? Ideally we would have them scheduled 130ms apart but have tried to increase the spacing to 1000ms to mitigate the issue

Steps to Reproduce

Schedule downlink at current time + 7 sec to gateway X
Schedule downlink at current time + 8 sec to gateway X
Wait for 10 seconds and observe the mqtt topic DowlinkFailed
If required run in a loop to increase the chance of reproducing the error

We see this approximately every 15 dowlinks

What do you see now?

17/11/2020 14:09:30.783 +1300   Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:37.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0003582"}}]}}]}
17/11/2020 14:09:30.883 +1300   Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:38.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}]}}]}
17/11/2020 14:09:37.068 +1300   Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}
17/11/2020 14:09:37.068 +1300   Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}

What do you want to see instead?

No downlink failures and messages emitted from the gateway

Environment

The Things Stack for LoRaWAN: ttn-lw-stack
Version: 3.9.4
Build date: 2020-09-23T09:56:19Z
Git commit: c4be55c
Go version: go1.15.2
OS/Arch: linux/amd64

How do you propose to implement this?

Determine the causes for the invalid absolute time and resolve if there is a bug

How do you propose to test this?

Happy to test a PR in our dev environment

Can you do this yourself and submit a Pull Request?

@rvolosatovs ?

bug gateway server

Source

virtualguy

👍3

All 15 comments

@virtualguy is this issue persisting?

What is the data rate and in which region are you?

When you say you want to transmit with 130 ms apart, are you taking the time-on-air into account?

Can you subscribe to the gateway and end device events, via $ ttn-lw-cli events ..., and paste the exact error messages that Gateway Server reports on why the scheduling fails?

johanstokking on 23 Dec 2020

@virtualguy I added some extra debug lines. Pick any of:

Apply patch investigate-3487.txt
Cherry-pick https://github.com/TheThingsNetwork/lorawan-stack/commit/f9bbda5db5090dd7eb7bc2d798c92bed82efc2dd
Check-out https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing (based on v3.10.7)

Please grep output by #3487 and copy here. It's also the only output that goes to stdout (as logs go to stderr).

If it says in ScheduleAt that there are too few RTTs, you might want to increase TTN_LW_EXP_RTT_TTL (duration, like 6h for 6 hours, default is 30m for 30 minutes) and/or decrease TTN_LW_EXP_SCHEDULE_MIN_RTT_COUNT to 3 or something (default is 5). These are temporary feature flags.

cc @ymgupta

johanstokking on 22 Jan 2021

👍1

fyi we are having some troubling running binaries we have built so blocked on #3736 before we can gather logs from your patch @johanstokking

virtualguy on 3 Feb 2021

👍1

@virtualguy @johanstokking Here are some logs from running the patched version: https://gist.github.com/kurtmc/75f4ecf93c2f7a1ee8373d3a9c7f181a

kurtmc on 4 Feb 2021

Thanks a lot. Apologies for the delayed response.

This is going to be super helpful. For finding the root cause, I added a few more log entries.

Can you run this again, with a new build, using https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing?

If you want to cherry-pick on v3.10.x, cherry-pick 50f56055ba2c0172c002784d0ef22f140b60903c and d1e5305d2363aa8ce8dc485b36c9977857da6b66

Also for the output, I need the full trace, from the beginning.

Note that we're now printing the gateway ID (or EUI). If that is sensitive information, please redact that to another meaningful value or send the log via email.

johanstokking on 10 Feb 2021

@kurtmc @virtualguy let me know how we can help this setting this up. If I need to send you a binary of Docker image, just let me know.

johanstokking on 12 Feb 2021

@johanstokking Updated logs: https://gist.github.com/kurtmc/041dd593d24fd9f01784e56ec1deb325

kurtmc on 14 Feb 2021

Thanks @kurtmc. I don't see the "no absolute time" errors appearing here, only in the beginning but that is normal. So here, everything worked as expected, right?

@adriansmares the race for which synchronization is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794 is actually happening here. So that is real, see:

"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Record: 30.629582ms at 2021-02-14 21:58:36.014795727 +0000 UTC m=+86.734546258"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"

Note that these statements are out-of-order because of underlying flushing; it looks like short statements are flushed immediately and longer statements are delayed.

I think that the issue is that when (*rtts).Record() releases the write lock, both concurrent (*rtts).Stats() calls acquire a read lock, and so two concurrent (*Scheduler).ScheduleAt() calls become exactly in sync. As those are acquiring (another) read lock, they happen concurrently, leading to corruption in Scheduler state.

This is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794.