Lorawan-stack: Class C messages intermittently fail to schedule with invalid absolute time

Created on 17 Nov 2020  ·  15Comments  ·  Source: TheThingsNetwork/lorawan-stack

Summary

Class C messages intermittently fail to schedule with "invalid absolute time set in application downlink" when closely spaced.

Is there a minimum time between messages to the same device? Ideally we would have them scheduled 130ms apart but have tried to increase the spacing to 1000ms to mitigate the issue

Steps to Reproduce

  1. Schedule downlink at current time + 7 sec to gateway X
  2. Schedule downlink at current time + 8 sec to gateway X
  3. Wait for 10 seconds and observe the mqtt topic DowlinkFailed
  4. If required run in a loop to increase the chance of reproducing the error

We see this approximately every 15 dowlinks

What do you see now?

17/11/2020 14:09:30.783 +1300   Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:37.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0003582"}}]}}]}
17/11/2020 14:09:30.883 +1300   Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:38.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}]}}]}
17/11/2020 14:09:37.068 +1300   Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}
17/11/2020 14:09:37.068 +1300   Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}

What do you want to see instead?

No downlink failures and messages emitted from the gateway

Environment

The Things Stack for LoRaWAN: ttn-lw-stack
Version: 3.9.4
Build date: 2020-09-23T09:56:19Z
Git commit: c4be55c
Go version: go1.15.2
OS/Arch: linux/amd64

How do you propose to implement this?

Determine the causes for the invalid absolute time and resolve if there is a bug

How do you propose to test this?

Happy to test a PR in our dev environment

Can you do this yourself and submit a Pull Request?

@rvolosatovs ?

bug gateway server

All 15 comments

@virtualguy is this issue persisting?

What is the data rate and in which region are you?

When you say you want to transmit with 130 ms apart, are you taking the time-on-air into account?

Can you subscribe to the gateway and end device events, via $ ttn-lw-cli events ..., and paste the exact error messages that Gateway Server reports on why the scheduling fails?

@virtualguy I added some extra debug lines. Pick any of:

  1. Apply patch investigate-3487.txt
  2. Cherry-pick https://github.com/TheThingsNetwork/lorawan-stack/commit/f9bbda5db5090dd7eb7bc2d798c92bed82efc2dd
  3. Check-out https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing (based on v3.10.7)

Please grep output by #3487 and copy here. It's also the only output that goes to stdout (as logs go to stderr).

If it says in ScheduleAt that there are too few RTTs, you might want to increase TTN_LW_EXP_RTT_TTL (duration, like 6h for 6 hours, default is 30m for 30 minutes) and/or decrease TTN_LW_EXP_SCHEDULE_MIN_RTT_COUNT to 3 or something (default is 5). These are temporary feature flags.

cc @ymgupta

fyi we are having some troubling running binaries we have built so blocked on #3736 before we can gather logs from your patch @johanstokking

@virtualguy @johanstokking Here are some logs from running the patched version: https://gist.github.com/kurtmc/75f4ecf93c2f7a1ee8373d3a9c7f181a

Thanks a lot. Apologies for the delayed response.

This is going to be super helpful. For finding the root cause, I added a few more log entries.

Can you run this again, with a new build, using https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing?

If you want to cherry-pick on v3.10.x, cherry-pick 50f56055ba2c0172c002784d0ef22f140b60903c and d1e5305d2363aa8ce8dc485b36c9977857da6b66

Also for the output, I need the full trace, from the beginning.

Note that we're now printing the gateway ID (or EUI). If that is sensitive information, please redact that to another meaningful value or send the log via email.

@kurtmc @virtualguy let me know how we can help this setting this up. If I need to send you a binary of Docker image, just let me know.

Thanks @kurtmc. I don't see the "no absolute time" errors appearing here, only in the beginning but that is normal. So here, everything worked as expected, right?

@adriansmares the race for which synchronization is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794 is actually happening here. So that is real, see:

"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Record: 30.629582ms at 2021-02-14 21:58:36.014795727 +0000 UTC m=+86.734546258"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1  | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"

Note that these statements are out-of-order because of underlying flushing; it looks like short statements are flushed immediately and longer statements are delayed.


I think that the issue is that when (*rtts).Record() releases the write lock, both concurrent (*rtts).Stats() calls acquire a read lock, and so two concurrent (*Scheduler).ScheduleAt() calls become exactly in sync. As those are acquiring (another) read lock, they happen concurrently, leading to corruption in Scheduler state.

This is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794.

@kurtmc is there a chance we can get DEBUG level logs from The Things Stack as well? Set log.level: debug in the config YAML.

@virtualguy @kurtmc can you verify that this is resolved on latest v3.11?

Note the minor bump, you need to run DB migrations, see https://github.com/TheThingsNetwork/lorawan-stack/blob/e56f7f70e60dba8c1ad584411fb63a8c35659e7c/CHANGELOG.md#3110---2021-02-10

We'll be rolling a 3.11.1 release today.

Closed by #3794

@virtualguy @kurtmc FYI we're backporting this to 3.10.10 so we can update our infrastructure sooner. Please keep an eye on #3800 and/or subscribe to release notifications here.

@johanstokking Just to let you know, we have upgraded our production environment to 3.10.10 and we are still seeing the absolute time error in the logs.

@kurtmc it is expected in the following case:

  1. The gateway does not provide GPS timestamps, so there is no absolute time on the gateway and it must be calculated on the gateway
  2. The gateway has not transmitted more than 5 downlink messages
  3. The gateway has not confirmed more than 5 downlink messages (via TX acknowledgment)

We use the latency between scheduling the downlink message and receiving the TX acknowledgment as the round-trip time. We need at least 5 of them to reliable take the median value. Then, when scheduling a class C downlink message with absolute time, Gateway Server uses the server time and the median round-trip time to calculate the absolute (server) time and the corresponding concentrator timestamp.


If you see absolute time errors still, outside the cases above, please provide DEBUG level server logs.

Definitely still seeing this issue in 3.10.10, looks like it happens on back to back transmissions (1000ms apart on the same device-id). I have sent through DEBUG logs via TTI support

Was this page helpful?
0 / 5 - 0 ratings

Related issues

johanstokking picture johanstokking  ·  6Comments

johanstokking picture johanstokking  ·  8Comments

johanstokking picture johanstokking  ·  5Comments

kschiffer picture kschiffer  ·  4Comments

adriansmares picture adriansmares  ·  8Comments