Class C messages intermittently fail to schedule with "invalid absolute time set in application downlink" when closely spaced.
Is there a minimum time between messages to the same device? Ideally we would have them scheduled 130ms apart but have tried to increase the spacing to 1000ms to mitigate the issue
We see this approximately every 15 dowlinks
17/11/2020 14:09:30.783 +1300 Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:37.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0003582"}}]}}]}
17/11/2020 14:09:30.883 +1300 Sending TTN V3 multicast. topic: v3/halter/devices/2704aee7-7c77-4c99-8d26-cf110b1a90a7/down/push. Payload: {"downlinks":[{"f_port":1,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","priority":"ABOVE_NORMAL","class_b_c":{"absolute_time":"2020-11-17T01:09:38.000Z","gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}]}}]}
17/11/2020 14:09:37.068 +1300 Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}
17/11/2020 14:09:37.068 +1300 Serial 2704aee7-7c77-4c99-8d26-cf110b1a90a7: Receiving DownlinkFailed. Payload: {"end_device_ids":{"device_id":"2704aee7-7c77-4c99-8d26-cf110b1a90a7","application_ids":{"application_id":"halter"},"dev_addr":"01415A3E"},"correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA","as:up:01EQ9W005A64G2W1Y3GQA9S3SR"],"received_at":"2020-11-17T01:09:37.066774351Z","downlink_failed":{"downlink":{"f_port":1,"f_cnt":56,"frm_payload":"ChQKEgoQEg4IAxIFDbQiy10YASDjAQ==","class_b_c":{"gateways":[{"gateway_ids":{"gateway_id":"eui-00800000a0005314"}}],"absolute_time":"2020-11-17T01:09:38Z"},"priority":"ABOVE_NORMAL","correlation_ids":["as:downlink:01EQ9VZT44AARKQ3V190WBSXZA"]},"error":{"namespace":"pkg/networkserver","name":"absolute_time","message_format":"invalid absolute time set in application downlink","code":3}}}
No downlink failures and messages emitted from the gateway
The Things Stack for LoRaWAN: ttn-lw-stack
Version: 3.9.4
Build date: 2020-09-23T09:56:19Z
Git commit: c4be55c
Go version: go1.15.2
OS/Arch: linux/amd64
Determine the causes for the invalid absolute time and resolve if there is a bug
Happy to test a PR in our dev environment
@rvolosatovs ?
@virtualguy is this issue persisting?
What is the data rate and in which region are you?
When you say you want to transmit with 130 ms apart, are you taking the time-on-air into account?
Can you subscribe to the gateway and end device events, via $ ttn-lw-cli events ...
, and paste the exact error messages that Gateway Server reports on why the scheduling fails?
@virtualguy I added some extra debug lines. Pick any of:
v3.10.7
)Please grep output by #3487
and copy here. It's also the only output that goes to stdout
(as logs go to stderr
).
If it says in ScheduleAt
that there are too few RTTs, you might want to increase TTN_LW_EXP_RTT_TTL
(duration, like 6h
for 6 hours, default is 30m
for 30 minutes) and/or decrease TTN_LW_EXP_SCHEDULE_MIN_RTT_COUNT
to 3 or something (default is 5). These are temporary feature flags.
cc @ymgupta
fyi we are having some troubling running binaries we have built so blocked on #3736 before we can gather logs from your patch @johanstokking
@virtualguy @johanstokking Here are some logs from running the patched version: https://gist.github.com/kurtmc/75f4ecf93c2f7a1ee8373d3a9c7f181a
Thanks a lot. Apologies for the delayed response.
This is going to be super helpful. For finding the root cause, I added a few more log entries.
Can you run this again, with a new build, using https://github.com/TheThingsNetwork/lorawan-stack/tree/investigate/3487-abs-downlink-timing?
If you want to cherry-pick on v3.10.x, cherry-pick 50f56055ba2c0172c002784d0ef22f140b60903c and d1e5305d2363aa8ce8dc485b36c9977857da6b66
Also for the output, I need the full trace, from the beginning.
Note that we're now printing the gateway ID (or EUI). If that is sensitive information, please redact that to another meaningful value or send the log via email.
@kurtmc @virtualguy let me know how we can help this setting this up. If I need to send you a binary of Docker image, just let me know.
@johanstokking Updated logs: https://gist.github.com/kurtmc/041dd593d24fd9f01784e56ec1deb325
Thanks @kurtmc. I don't see the "no absolute time" errors appearing here, only in the beginning but that is normal. So here, everything worked as expected, right?
@adriansmares the race for which synchronization is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794 is actually happening here. So that is real, see:
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: median is 30.900256ms"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: relative time downlink"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 1 ScheduleAt: scheduled"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 Record: 30.629582ms at 2021-02-14 21:58:36.014795727 +0000 UTC m=+86.734546258"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"
"1613339916024","15/02/2021 10:58:36.024 +1300","stack_1 | #3487 Stats: sorted items: [{30446978 {13835767378842655859 75767631763 0x3a7a1c0}} {30504816 {13835767372765423622 70132850448 0x3a7a1c0}} {30639590 {13835767387106897922 83515681039 0x3a7a1c0}} {30830194 {13835767336362635015 36237283864 0x3a7a1c0}} {30834436 {13835767341445058316 40950998050 0x3a7a1c0}} {30835582 {13835767336764530097 36639178949 0x3a7a1c0}} {30842181 {13835767341958726238 41464666012 0x3a7a1c0}} {30845267 {13835767320071738555 21052514770 0x3a7a1c0}} {30847416 {13835767323811969378 24571520108 0x3a7a1c0}} {30889585 {13835767369324628633 66913280952 0x3a7a1c0}} {30910927 {13835767324288677194 24974486184 0x3a7a1c0}} {30943729 {13835767344594801745 43879516001 0x3a7a1c0}} {30952793 {13835767349113727476 48103474433 0x3a7a1c0}} {30963569 {13835767326445257112 26983582371 0x3a7a1c0}} {30993783 {13835767329486798315 29803898110 0x3a7a1c0}} {31018584 {13835767328128331563 28592914998 0x3a7a1c0}} {31045641 {13835767318515021603 19643281462 0x3a7a1c0}} {31073394 {13835767334606151403 34628283902 0x3a7a1c0}} {31090774 {13835767382891657128 79595407630 0x3a7a1c0}} {31249183 {13835767338921164328 38648329532 0x3a7a1c0}}]"
Note that these statements are out-of-order because of underlying flushing; it looks like short statements are flushed immediately and longer statements are delayed.
I think that the issue is that when (*rtts).Record()
releases the write lock, both concurrent (*rtts).Stats()
calls acquire a read lock, and so two concurrent (*Scheduler).ScheduleAt()
calls become exactly in sync. As those are acquiring (another) read lock, they happen concurrently, leading to corruption in Scheduler
state.
This is fixed with https://github.com/TheThingsNetwork/lorawan-stack/pull/3794.
@kurtmc is there a chance we can get DEBUG
level logs from The Things Stack as well? Set log.level: debug
in the config YAML.
@virtualguy @kurtmc can you verify that this is resolved on latest v3.11
?
Note the minor bump, you need to run DB migrations, see https://github.com/TheThingsNetwork/lorawan-stack/blob/e56f7f70e60dba8c1ad584411fb63a8c35659e7c/CHANGELOG.md#3110---2021-02-10
We'll be rolling a 3.11.1 release today.
Closed by #3794
@virtualguy @kurtmc FYI we're backporting this to 3.10.10 so we can update our infrastructure sooner. Please keep an eye on #3800 and/or subscribe to release notifications here.
@johanstokking Just to let you know, we have upgraded our production environment to 3.10.10 and we are still seeing the absolute time error in the logs.
@kurtmc it is expected in the following case:
We use the latency between scheduling the downlink message and receiving the TX acknowledgment as the round-trip time. We need at least 5 of them to reliable take the median value. Then, when scheduling a class C downlink message with absolute time, Gateway Server uses the server time and the median round-trip time to calculate the absolute (server) time and the corresponding concentrator timestamp.
If you see absolute time errors still, outside the cases above, please provide DEBUG level server logs.
Definitely still seeing this issue in 3.10.10, looks like it happens on back to back transmissions (1000ms apart on the same device-id). I have sent through DEBUG logs via TTI support