Deconz-rest-plugin: IKEA lights occasionally lost connection

Created on 14 Feb 2019  ·  493Comments  ·  Source: dresden-elektronik/deconz-rest-plugin

Occasionally a light (mostly a Tradfri GU10) gets unavailable in the Phoscon app and can not be switched off/on via Phoscon (or HASS). Using deCONZ 2.05.55 and firmware 262F0500 on the Conbee now, but got same problem with older versions of deCONZ and Conbee firmware.


_(Not always this light)_

  1. Any clue why?
  2. Is it possbile to restore the connection other then disconnect/connect power?
Backlog Confirmed Bug To-Do

Most helpful comment

Some more analysis today...
In my previous posts, you have seen that my Garage light is routed through my Zolder light. Both IKEA bulbs. The radio link from Zolder to Garage is right on the edge of what it can reach, so fails often.

Today, although the Garage light responds to group commands, it does not respond to unicast commands. Actually sometimes it does and sometime not. This is a behavior that should be familiar to those who've read/contributed to this thread.

I can find this in the sniffing logs. Sometimes the Zolder light is able to communicate with Garage and sometimes not. Any time Zolder light cannot communicate with Garage, it reports this:

ZigBee Network Layer Command, Dst: DeConz, Src: Zolder No
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0x0000[DeConz]
    Source: 0xd6b7[Zolder Noo]
    Radius: 30
    Sequence Number: 65
    Destination: dresden-_ff:ff:00:c4:9a (00:21:2e:ff:ff:00:c4:9a)
    Extended Source: SiliconL_ff:fe:c3:4a:3e (00:0b:57:ff:fe:c3:4a:3e)
    ZigBee Security Header
    Command Frame: Network Status
        Command Identifier: Network Status (0x03)
        Status Code: Non-tree Link Failure (0x02)
        Destination: 0x9e0c[Garage]

This packet should tell DeConz to start finding another route to reach Garage, but that does not happen. The next packet sent to Garage is again routed through Zolder. To me that is a bug that must be solved.
This next packet for Garage is received by the Zolder light, but that light does not even try to send it to the Garage. Maybe this is a behavior of the IKEA firmware which is not good, but the root cause of the issue is the refusal of DeConz to find an alternative route.
I think that if a route is not available for a prolonged period of time, maybe the Garage light is starved for ACKs at a higher level than the 802.15.4 protocol and that may cause the firmware to disconnect or even crash. And I agree it should not, but the root cause issue is DeConz refusing to find a new route to the Garage light.

Today I did an experiment to get DeConz to find another route to the Garage light, so I disconnected mains from the Zolder light and looked at the sniffing logs. After a few tries, DeConz realizes that Zolder is gone and goes ahead to find an alternative route to Garage. Next I reconnect Zolder and after announcing its presence also for Zolder a new route is found. DeConz does not (yet) revert to routing Garage through Zolder.

Funny thing is that in the new situation, DeConz now directly talks with Garage light, no routers inbetween.
Zolder is now reached through a route via 2 other routers (although it was obviously reachable directly by DeConz), so it looks to me like some table (neighbor table?) is full inside DeConz routing firmware.

Maybe this is related to its refusal to create a new route in response to a failing route..?

@manup , I would appreciate any comment from you on the above posts. Or at least let me know how to contribute to a solution (aside from looking for the root cause).

I would like to help creating a solution for these issues, since they bother me. If you'd give access to the firmware source code I can contribute directly (even if it is not open source). I don't mind helping Dresden Elektronik in that way :)

All 493 comments

Same here with 2.05.58. One Tradfri GU10 seems to be unresponsible atm:
image
Happened to me with a hue light strip as well a few days before so I dont suppose any IKEA specific issue. Devices have to be powercycled and are back as normal after that. Still annoying in some cases, mostly for my FLOALT Panels which are directly powered and do not have a wall switch to powercycle them.

  1. Any clue why?

The REST API plugin marks a light as unreachable, when it doesn't receive a response for a couple of times when polling the light for its state. The cause of not receiving a response is, in order of likelihood:
a. The light's power has been cut (e.g. by a 20th century wall switch);
b. The Zigbee network has a hiccup (e.g. due to radio interference or routing issues in the mesh). In this case, the light still reacts to group commands;
c. The light's firmware has crashed.

  1. Is it possbile to restore the connection other then disconnect/connect power?

In a) and c): no. In b): yes.

The REST API plugin marks the light as reachable, when it receives a message from it. Powering up the light causes it to send a _Device Announcement_ message. In b), typically, the light comes back spontaneously, when the next poll succeeds. You can also select the node in the deCONZ GUI and press 0.

Same problem also in version 2.05.59.

I also have this problem, even after upgrading to 2.05.59. Today was one of my three outdoor-lights "gone".
Its Tradfri bulbs all threee of them.
image

@ebaauw Thanks for your explanation.

a. The light's power has been cut (e.g. by a 20th century wall switch);

No wall switches available for these lights, so I am not able to accidentally disconnect power.

b. The Zigbee network has a hiccup (e.g. due to radio interference or routing issues in the mesh). In this case, the light still reacts to group commands;

The lights doesn't react on group commands when connection is lost (the lights are assigned to a Hue Dimmer in the Phoscon app and doesn't respond on the Hue Dimmer when connection is lost).

c. The light's firmware has crashed.

Firmware 1.2.214 is installed on all my IKEA GU10 spots. Got 20+ of them and a random light goes offline, let's say one in the 2-3 weeks.

I had the same experience two times the last months with two different E14 IKEA bulbs (IKEA fw 1.2.214) .
Power cycling worked both of the times for me.

c. The light's firmware has crashed.

When the lights don't react to even group commands it seems to looks like a firmware crash.

2.05.59 has adapted the IKEA gateway parameters to configure light state reporting. Mainly in the hope to not trigger any bugs by using configuration which IKEA itself doesn't test. Side note the change will cause no timers are used for reporting on the device anymore.

The new configuration will be applied once a light is power-cycled.

We still send some maintenance requests like group membership and neighbor table queries to the lights, and might restrict these further if stability doesn't improve with 2.05.59.

Keep in mind there might also be the possibility that a bug is in the light firmware which is not related to any requests the gateway sends.

c. The light's firmware has crashed.

When the lights don't react to even group commands it seems to looks like a firmware crash.

2.05.59 has adapted the IKEA gateway parameters to configure light state reporting. Mainly in the hope to not trigger any bugs by using configuration which IKEA itself doesn't test. Side note the change will cause no timers are used for reporting on the device anymore.

The new configuration will be applied once a light is power-cycled.

We still send some maintenance requests like group membership and neighbor table queries to the lights, and might restrict these further if stability doesn't improve with 2.05.59.

Keep in mind there might also be the possibility that a bug is in the light firmware which is not related to any requests the gateway sends.

+1 on "not necessarily related to deconz but rather the Zigbee network or manufacturer FW" in correlation with Zigbee standard interpretation in manufacturers FW.

I just updated to 2.05.59 and after restarting deconz the same light is not reachable again. Pressing 0 in gui doesn't bring it back. Any other light works. In my case this might as well be an issue with the light itself.

@peer69 @thomas70 Did you power-cycled the light as this is needed mentioned by @manup in https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-463948127 ?

The new configuration will be applied once a light is power-cycled.

Good hint, haven't done that. For now I had to go back to .58 for another reason (high cpu load turning the gateway unrepsonsive), will try this later today with .59 again and powercycle all ikea lights.

I'm also having this issue, it happened twice for the same GU 10 light. Currently running 2.05.59, and I did power cycle the lights after the update.

Forgot to add that it sometimes seems like it's the same bulb that keeps failing. A while ago I did have issues with another bulb, and it would always be that one to stop reacting.

After power cycle my IKEA bulbs came back. But my IKEA FLOALT panel WS is still offline

I'm experiencing this same issue and have been doing so for quite a while, and I'd say that .59 actually made things worse for me. I have 80 nodes of which 32 are Trådfri lights and switches, 5 are Hue lights and the rest are different Xiaomi battery powered devices like temperature, motion, smoke detector etc. Every single type of device has been unresponsive at least once so in my case it's not just the Trådfri lights, but at the time I'm just having issues with the Trådfri and Hue lights.

The thing is that I ran all the lights through a Hue bridge and the Xiaomi sensors through the Xiaomi gateway earlier and then they were all rock solid, so I don't think it's the device firmware that's the culprit in my case unless it's caused by the change in circumstances.

I have six Trådfri GU10 lights in one location that worked perfectly before, but after the upgrade to .59 and several power cycles later they are now almost completely unresponsive and I will probably have to reset them. What's strange is that this unresponsiveness also seems to be "moving" from different lights depending on which lights that have power. If I cut the power to some of the unresponsive lights it may take a while and then it's suddenly some other light that doesn't want to work properly. Perhaps there's some offset somewhere that's breaking things?

The thing is that I ran all the lights through a Hue bridge and the Xiaomi sensors through the Xiaomi gateway earlier and then they were all rock solid, so I don't think it's the device firmware that's the culprit in my case unless it's caused by the change in circumstances.

Interesting, did you also have all 32 Ikea lights on the Hue network? I'm asking because Hue bridge uses polling only and doesn't configure attribute reporting.

Did you also have router devices like the Hue or Ikea lights on the Xiaomi network?

I have six Trådfri GU10 lights in one location that worked perfectly before, but after the upgrade to .59 and several power cycles later they are now almost completely unresponsive and I will probably have to reset them. What's strange is that this unresponsiveness also seems to be "moving" from different lights depending on which lights that have power. If I cut the power to some of the unresponsive lights it may take a while and then it's suddenly some other light that doesn't want to work properly. Perhaps there's some offset somewhere that's breaking things?

Hmm that's pretty bad I really wonder how this happens, 2.05.59 is way "familiar" to Ikea lights than prior versions. The configuration is now happening like the Ikea gateway does it.

When a light becomes unresponsive can you please select the node in deCONZ and press 0 if it gets responsive/yellow again the light don't need to be power-cycled. Note the light becoming a Zombie in this case will be fixed soon, this may happen on a certain network constellation currently.

By the way the usual questions:

  • Which firmware version are you using?
  • RaspBee or ConBee?
  • If ConBee do you use a usb extension cable?

It took a while longer than expected but now everything actually appears to be working fine again. At least for now from what I can tell.

I rebooted the server and also power cycled every single mains powered light in the network to make sure that they fetched the latest configuration, but despite this it took a couple of hours before the issue went away so I was a bit premature in my assumption that the issue remained as it did not work right away.

Interesting, did you also have all 32 Ikea lights on the Hue network? I'm asking because Hue bridge uses polling only and doesn't configure attribute reporting.

Yes, sort of. I had 31 Ikea lights on the Hue network as well as the Hue lights. The 32nd Ikea device is the switch outlet which I hadn't bought back then.

Did you also have router devices like the Hue or Ikea lights on the Xiaomi network?

No, just battery powered sensors

When a light becomes unresponsive can you please select the node in deCONZ and press 0 if it gets responsive/yellow again the light don't need to be power-cycled. Note the light becoming a Zombie in this case will be fixed soon, this may happen on a certain network constellation currently.

I did try this multiple times earlier with no effect. And as for the hardware and setup, I'm using a ConBee with USB extension cable and 262F0500. Since everything seems to be working fine for me now this info may not be of any use at the moment but I'll try not to jump to any conclusions and let the network run for a few days to make sure that the issue doesn't return.

I have been running .59 since last week-end and I still lose random Ikea lights. (16 E27 bulbs on house facade.)
Bulb FW is the same as others still on the Ikea Gateway.
Using ConBee with 262F0500 FW.
Last week-end I also bought a HUE bridge and was just about to start moving the lights over when I noticed the 'Under the hood' release note for .59. Decided to hold off, but will re-consider this upcoming week-end.
Deconz will still be my best Xiaomi/mi Cube controller of choice. Haven't missed a gesture yet.

I have been running .59 since last week-end and I still lose random Ikea lights. (16 E27 bulbs on house facade.)
Bulb FW is the same as others still on the Ikea Gateway.
Using ConBee with 262F0500 FW.
Last week-end I also bought a HUE bridge and was just about to start moving the lights over when I noticed the 'Under the hood' release note for .59. Decided to hold off, but will re-consider this upcoming week-end.
Deconz will still be my best Xiaomi/mi Cube controller of choice. Haven't missed a gesture yet.

I have kind of the same situation here, 16 IKEA lights, 2 IKEA control outlets, a Heiman plug and an innr plug and some Xiaomi sensors (cube/door sensors/motion sensor). Never had problems with the non-IKEA devices. However I currently have almost daily issues where a IKEA light drops out of the network

I use a Conbee with a USB extension cable on firmware 0x26300500 and deCONZ .59

My lights have been working fine for a while now but a couple of days ago my Trådfri E14 bulb suddenly became unresponsive. One power cycle later it came back to life.

Today it was time for one of the GU10's to drop out. It's physically very close to the previously mentioned E14 so I'm not sure if it's a coincidence or not. The GU10 may very well have been routed via the E14 even though all my lights are within ConBee range.

Selecting the nodes and pressing 0 in deCONZ does not do anything. I have also tried rebooting the deCONZ container and while reroutes the network on startup it does not connect any route to that specific bulb. What would be the best approach here to proceed with the troubleshooting?

12 days later, another GU10 become unreachable and will not connect again without a power cycle.

Happy to share whatever info is needed to get into this issue.

Same here, yesterday after 6 days flawless connection, I lost one of my Tradfri bulbs.. power on/off and reset didn't help. Its still yellow in deConz but cant connect or control it.

image

Same here. After some days without any issue today two of my GU10 Tradfri lamps stopped responding. I was able to bring one of them back to life by pressing 0 in GUI, but I had to Powercycle the other one.
Fortunately this only seems to happen for GU10 devices atm, my FLOALT Panels had no issues yet (in my setup they can only be powercycled by using the circuit breaker).

The issue has continued for me as well. I have now experienced 3-4 more GU10 bulbs losing connection as well as one of my Hue E27's and a Xiaomi door sensor (magnet). Some lights have started working again after a power cycle, others have not. Pressing 0 does nothing.

It's also noteworthy that the Xiaomi sensor started working again after I power cycled an adjacent and unresponsive GU10 so I suppose that the sensor was routing through that light, but shouldn't it automatically reroute if there are any connection issues?

Same issue here. Yesterday I updated to the latest version .59 now a couple of Ikea lights are unresponsive

Hi can you give more insights of the total network, like network size and other mains powered devices in there?

I've rearranged my home network a few days ago, now including:

  • 5x IKEA WS GU10 (firmware 1.2.221, product code LED1537R6GU10EU)
    With mac address 0x000b57ff..... (older batch)
  • 2x IKEA dimmable E27 (firmware 1.2.214)
  • 1x IKEA E14 WS light (firmware 1.2.221)
  • 1x IKEA repeater (firmware 2.0.019)
  • 1x IKEA outlet (firmware 2.0.019)
  • 1x OSRAM GU 10
  • 1x OSRAM E27 color bulb
  • 1x OSRAM plug
  • 1x Hue E27 color bulb
  • 1x Hue E27 dimmable lux bulb

deCONZ 2.05.59; ConBee firmware 0x26300500 (but 0x262f0500 is fine too).

I have 4x FLS-PP lp but these are powered off now for testing, since they act as very strong signal repeaters.

With sensors and switches the total network size is 55.
All lights are always powered and till now show zero outages.

Here are some more detailed specifications of my network if it can be of any help. I’m still running 2.05.59 with 262F0500 and an extension cord to the ConBee. As mentioned above, after first updating to 2.05.59 and power cycling every mains powered device and waiting for a couple of hours the network was flawless for almost a week, so it seems to take a while until the issues start to appear. Unfortunately the issue reappeared and a full power cycle of all mains powered devices as well as a deCONZ reboot does not resolve the issue anymore. It also seems that the issue is wandering from device to device because sometimes a light may be unresponsive for a while and then it fixes itself.

Earlier today I had an issue where the Trådfri E14 was unresponsive as well as one Hue E27. After a power cycle of the E27 the E14 came back to life as well without me even touching it. The same goes for the unresponsive GU10's that seem to be trading places now and then, so there are at least two unresponsive GU10's every day but it's not always the same lights so some start working while others break and vice versa.

My network currently consists of the following 80 devices including ConBee and the mains powered devices are powered 24/7.

Mains powered

| Quantity | Type | Firmware |
|----------|------|-------------|
|30 | Trådfri GU10 dimmable | 1.2.214 |
| 4 | Trådfri GU10 white spectrum | 1.2.217 |
| 1 | Trådfri E14 opal dimmable | 1.2.217 |
| 1 | Trådfri control outlet | 1.4.020 |
| 3 | Hue E27 White and Color A19 | 1.29.0_r21169 |
| 2 | Hue E14 White ambiance LTW012 | 1.29.0_r21169 |

Battery powered

Quantity | Type | Firmware
--------- | ------| -----------
1 | Trådfri on/off switch | 1.4.018
10 | Xiaomi Aqara multisensor (square temp/hum/pres) | 20161129
3 | Xiaomi Aqara motion sensor (motion/lux) | 20170627
4 | Xiaomi Aqara water sensor | 20170721
1 | Hue motion sensor | 6.1.0.18912
11 | Xiaomi Aqara contact sensor | 20161128
8 | Xiaomi/Honeywell smoke sensor | N/A

Last week deconz seemed to run mostly fine, but yesterday I had another IKEA bulb (white spectrum) losing connection to deconz. Even turning it off and on again didn't help. Had to restart deconz for it to work again somehow.

I've got a network with mostly IKEA bulbs, a Heiman outlet and quite some Xiaomi sensors.

Some specifications of my zigbee network:

Conbee firmware 262F0500 with extension cable on a NUC.
deCONZ 2.05.55 in Docker, so the first thing I have to do is upgrade to 2.05.59 I guess.

Powered (24/7)

| Quantity | Type | Firmware |
| ------------- | ------------- | ------------- |
|4x | Tradfri E27 white| 1.1.1.0-5.7.2.0|
|2x| Tradfri E27 white| 1.2.214|
|21x| Tradfri GU10 dimmable| 1.2.214|
|3x| Osram Smart+ socket| 1.04.12|

Battery powered

| Quantity | Type | Firmware |
| ------------- | ------------- | ------------- |
|3x| Hue Dimmer Switch| 5.45.1.17846|
|1x| Aqara smart switch| 20180525|
|1x| Aqara smart switch| 20161128|
|1x| Aqara double wireless switch| 20170411|
|1x| TRADFRI remote| 1.2.214|
|6x| Aqara multisensor| 20161129|
|10x| Aqara contactsensor| 20161128|
|5x| Aqara motion sensor| 20170627|
|1x| Aqara leak sensor| 20170721|
|1x| Aqara vibration sensor| 20180130|

Any updates on this for the current version? I have been running .60 for 3 days and no light has lost connection yet.

Unfortunately I already had a lost connection with a regular Tradfri E27 white bulb on .60 and the newest firmware.

That's bad news... if I understand correctly the polling intervals have been changed in .60 to be less aggressive. Aggressive polling causing a light to hang up made perfect sense to me, too bad this doesn't seem to be the solution to our problem.

Yesterday I turned off the power for all mains powered devices and updated to 2.05.60 and 26320500 before turning them on again, just to play it safe. The lights then all worked fine for about 24 hrs but just a few minutes ago I noticed that one of my GU10's had stopped responding. Luckily enough it came back to life again some minutes later without any manual interaction from my end so perhaps the network was just clogged.

@JBS5 I would recommend to update the 4x Tradfri E27 white at 1.1.1.0-5.7.2.0 to a recent firmware version. If I recall correctly this is still the very first version.

@jurriaan which version has your Tradfri E27 white bulb?

That's bad news... if I understand correctly the polling intervals have been changed in .60 to be less aggressive.

Yep basically very similar to IKEA gateway. Now the only remaining difference is the periodically query of neighbor tables (which is used to display the mesh network lines).

This can be turned off by clicking on the CRE icon in deCONZ and uncheck "Routers and Coordinator".
Might be worth a test.

On Reddit there was a post mentioning that the IKEA gateway should in theory support up to 100 devices, but that it isn't testet very well. Would be interesting to know what the usual network size which IKEA team does test.

https://www.reddit.com/r/tradfri/comments/96yiq4/google_home_losing_lamps_and_rooms/e4x1scz/?context=1

You could probably have 100 devices connected to your Gateway. This is not tested properly by us, which is why we don't guarantee it. But the technical limit is 100, and I've seen people who have 100 devices with none or only minor issues.

New versions for the Gateway will support the same amount of devices (Officially 50). You could add another Gateway to your system if you want to double that .

@JBS5 I would recommend to update the 4x Tradfri E27 white at 1.1.1.0-5.7.2.0 to a recent firmware version. If I recall correctly this is still the very first version.

Those E27 bulbs with the old firmware didn't fail during the past 6 months while others did...

Very interesting, how about the E27 at version 1.2.214?

Very interesting, how about the E27 at version 1.2.214?

They lost connection only once in the past months.

It's been a few weeks ago since the last GU10 lost connection, this while I am still using deCONZ 2.05.55 and firmware 262F0500 on the Conbee.

I also have this problem. Only Ikea nodes, (43, mainly lights). I have no knowledge about zigbee, but since I have not seen it mentioned: my network seems more stable with OTAU disabled. The other day I also changed network preferences to less secure. Cannot remember which one, but after that I have not lost any lights.

After some days without any issue several GU10 have become unresponsive.
Another issue might be unrelated but an Osram light lost connection as well. Even though it was still shown in the GUI and seemed to mesh I couldn’t control it any more. Had to delete the light an readd it, it was assigned another Light No but regained its former name shortly after adding it. No idea what’s happening here but this is quite a bit more maintenance than I would like to see for my setup.

@peer69 did you also try just to power-cycle the light? Normally a factory reset shouldn't be needed.
You're on 2.05.60? Can you also provide some more details about your network, how many lights and mains powered devices?

@manup I powercycled the light. Several times. It was controllable for about 10 seconds after a powercycle but then turned unresponsive again (red lights flashing in GUI).
In the meanwhile this issue came back again even after the factory reset and also affected another OSRAM light. For now I have been getting rid of the only two OSRAM lights in my network and replaced them with hue bulbs. I can offer some testing with the ORSAM lights but I would need some time for that.

I am running 2.05.60. Currently there are 57 nodes connected to the network of which 27 devices are mains powered. I use 13 IKEA GU10, 1 IKEA FLOALT Panel, 2 IKEA E14, 3 OSRAM Smart+ Plugs, 3 E14 hue bulbs, 5 E27 hue bulbs, 1 hue lightstrip.

I also had to powercycle some IKEA GU10 in the past days which turned unrepsonsive. After the powercycle everything is back to normal and I dont see a pattern. I have lamps with several GU10 and no more than one GU10 turned unresponsive at the same time despite they are always controlled as a group.

At the moment I'm back to square one. Now I regularly have 2-3 lights that don't respond, but it jumps between different lights so sometimes they work and sometimes they don't depending on which other lights that are online. The issue seems to be cascading because when I physically turn off some lights by cutting the power to them, it shifts the issue to other lights.

I've also lost a couple of Xiaomi temperature sensors as well as door sensors and an IKEA on/off switch but they don't pop back in so they probably need to be re-paired in order to start working again.

Things worked fine immediately after a total power cycle as I noted earlier but a few days later the lights started getting unresponsive again and it's been like this for a couple of weeks now. Back when it happened the first time a guest accidentally unplugged the E14 for several hours so I'm not sure if it's completely unrelated or if unexpected disturbances of the zigbee mesh caused the routing to go crazy. Given that I'm apparently not the only one with these issues I think that it may just have been a coincidence.

I really like the concept of having all my zigbee devices in one single mesh but I'm almost at the point where I boot up the old Hue and Xiaomi gateways and put the ConBee in a drawer, which I really don't want to do for several other reasons. Does anyone have any tips for further troubleshooting that could help me identify exactly what's going on and how to resolve it?

One of my IKEA GU10 spots is now unresponsive too.

In the sniffer I see it's still somewhat "alive" and sends NWK Link Status messages, but it obviously thinks it is alone in the network (Link Status Count: 0).

image

It doesn't respond to unicast commands but sends the periodically ZCL attribute report of the modelid.

Sniffing since ~2 hours, not sure when it became unresponsive and if it is related but the report ZCL sequence number is low:

image

I'll do some more tests and won't powercycle it for a while.

My Tradri power socket acted really weird today too and did run in reboot circles, never had this one before.

Just noted a second IKEA GU10 is also a walking led, same symptoms as the other one.

Both devices don't respond to unicast, groupcast nor ZCP nwk address request (pressing 0).
They send empty NWK Link Status commands.

The second GU10 also did send ZCL OTA Query Next Image requests.

image

It seems that the response doesn't come through.

Only a wild guess but I figure the lights in-buffers are blocked and that's why nothing is received and processed. The out-buffers are still working, hence the firmware is able to send reports and ota queries.

It would be good if they implement something like a simple health check so that the firmware can reboot after a while, if mac layer is working (commands are received) but nwk and aps layers stay silent.

I also have this problem, even after upgrading to 2.05.59. Today was one of my three outdoor-lights "gone".
Its Tradfri bulbs all threee of them.
image

off topic. How do you find your light in all this ?lol. I have similar setup and I was like... nooo time for this

by similar i mean +- 50 devices

+- 50 is kind of ok. One of our test networks has +180 devices with deCONZ on a Raspberry Pi 1, that's fun :)

We have some plans to add better filter/sorting to deCONZ to simplify finding devices, currently it's really cumbersome at a certain device count.

The lights are still stuck.

One interesting observation: I powered off the parent of a Philips Hue motion sensor (a Hue Lux) so that it needs to search a new parent. The sensor now tries to rejoin through one of the stuck IKEA GU10 lights.

The light does respond with a Leave (with Rejoin) command. So it did process the rejoin request!

image

Sadly the Hue motion sensor is stubborn and tries to rejoin to the stuck GU10 light forever, instead looking for a better parent.

However the interesting part here is that the stuck IKEA light does respond to the rejoin request, perhaps it also processes NWK Leave requests, that could be a base of a workaround to get the light into a working state again.

Correction, Hue motion sensor isn't too stubborn; after a few minutes the sensor selected another working parent (good).

However the interesting part here is that the stuck IKEA light does respond to the rejoin request, perhaps it also processes NWK Leave requests, that could be a base of a workaround to get the light into a working state again.

I did test this, but sadly it doesn't work. While the lights are in stuck state they won't process the NWK Leave with rejoin request.

Currently I can only conclude that IKEA needs to fix either the root issue of the light going stuck or implement some kind of watchdog + health check for NWK/APS layers of the firmware.

It is still unknown what exactly causes the light going into this state – broadcast storm, certain commands, routing issues...?

I'll forward my findings and sniffer logs to IKEA hopefully they can use them to to track down the issue.

So @manup, what is best practice to rejoin a stuck IKEA light?

For me a power cycle of the light clearly works best

@manup I guess that if you don't know how to make the patient better, you should try to make it worse. So, bombard a light with potential causes for lockups and see what happens.
Also, aren't these lights based on the Ember stack? Maybe there are support teams or forums that might shed a light (pun intended).

W.r.t. to IKEA, are they responsive to requests for support like these? Or should we team up to gain some attention?

For me a power cycle of the light clearly works best

Not so much for me...
A restart of Conbee is the only thing that fix it temporarily.
And then the same, or more often, another IKEA light get stuck after a while...
Always just one light! It's so odd... And irritating... :/

@manup Awesome! Thanks for looking into this!

I've also forwarded this thread to the IKEA Tradfri team via Reddit. Just got a response from them saying that they've forwarded the information. Let's hope that they can use this to improve their firmware or find a solution for this issue :)

Seems like there’s a new firmware for the tradfri gateway coming in the next days which especially targets HomeKit support. There might be a new firmware for the bulbs as well.

@peer69 This is the latest firmware release:


version_info.json

[
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/190579-ncp572b444.ebl.ota.ota.signed",
    "fw_build_version": 444,
    "fw_file_version_LSB": 444,
    "fw_file_version_MSB": 5720,
    "fw_filesize": 166270,
    "fw_hotfix_version": 2,
    "fw_image_type": 2,
    "fw_major_version": 5,
    "fw_manufacturer_id": 4476,
    "fw_minor_version": 7,
    "fw_type": 1
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10005777-4.1-TRADFRI-control-outlet-2.0.022.ota.ota.signed",
    "fw_file_version_LSB": 9763,
    "fw_file_version_MSB": 8194,
    "fw_filesize": 204222,
    "fw_image_type": 4353,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10005778-4.1-TRADFRI-onoff-shortcut-control-2.0.020.ota.ota.signed",
    "fw_file_version_LSB": 1571,
    "fw_file_version_MSB": 8194,
    "fw_filesize": 182078,
    "fw_image_type": 4549,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10035514-TRADFRI-bulb-ws-1.2.221.ota.ota.signed",
    "fw_file_version_LSB": 5490,
    "fw_file_version_MSB": 4642,
    "fw_filesize": 172734,
    "fw_image_type": 8705,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10035534-TRADFRI-bulb-ws-gu10-1.2.221.ota.ota.signed",
    "fw_file_version_LSB": 5490,
    "fw_file_version_MSB": 4642,
    "fw_filesize": 172734,
    "fw_image_type": 8707,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10038562-TRADFRI-sy5882-bulb-ws-2.0.022.ota.ota.signed",
    "fw_file_version_LSB": 9763,
    "fw_file_version_MSB": 8194,
    "fw_filesize": 209790,
    "fw_image_type": 16900,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/1004764-TRADFRI-bulb-cws-1.3.009.ota.ota.signed",
    "fw_file_version_LSB": 38258,
    "fw_file_version_MSB": 4864,
    "fw_filesize": 178366,
    "fw_image_type": 10241,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159495-TRADFRI-transformer-1.2.245.ota.ota.signed",
    "fw_file_version_LSB": 21874,
    "fw_file_version_MSB": 4644,
    "fw_filesize": 181118,
    "fw_image_type": 16641,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159695-TRADFRI-bulb-ws-1000lm-1.2.217.ota.ota.signed",
    "fw_file_version_LSB": 30066,
    "fw_file_version_MSB": 4641,
    "fw_filesize": 173246,
    "fw_image_type": 8706,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159696-TRADFRI-bulb-w-1000lm-1.2.214.ota.ota.signed",
    "fw_file_version_LSB": 17778,
    "fw_file_version_MSB": 4641,
    "fw_filesize": 168318,
    "fw_image_type": 8449,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159697-TRADFRI-driver-hp-1.2.217.ota.ota.signed",
    "fw_file_version_LSB": 30066,
    "fw_file_version_MSB": 4641,
    "fw_filesize": 173246,
    "fw_image_type": 16898,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159698-TRADFRI-driver-lp-1.2.217.ota.ota.signed",
    "fw_file_version_LSB": 30066,
    "fw_file_version_MSB": 4641,
    "fw_filesize": 173246,
    "fw_image_type": 16897,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159699-2.1-TRADFRI-remote-control-1.2.223.ota.ota.signed",
    "fw_file_version_LSB": 13683,
    "fw_file_version_MSB": 4642,
    "fw_filesize": 159806,
    "fw_image_type": 4545,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159700-TRADFRI-motion-sensor-1.2.214.ota.ota.signed",
    "fw_file_version_LSB": 17778,
    "fw_file_version_MSB": 4641,
    "fw_filesize": 157822,
    "fw_image_type": 4548,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/159701-TRADFRI-wireless-dimmer-1.2.248.ota.ota.signed",
    "fw_file_version_LSB": 34162,
    "fw_file_version_MSB": 4644,
    "fw_filesize": 172926,
    "fw_image_type": 4546,
    "fw_manufacturer_id": 4476,
    "fw_type": 2
  },
  {
    "fw_binary_url": "http://fw.ota.homesmart.ikea.net/Tradfri_OTA_release_signed_2019_04_05_111559/bin/10032198-2.1-TRADFRI-gateway-1.8.25.p.elf.sig.ota.signed",
    "fw_filesize": 698748,
    "fw_hotfix_version": 25,
    "fw_major_version": 1,
    "fw_minor_version": 8,
    "fw_req_hotfix_version": 26,
    "fw_req_major_version": 9,
    "fw_req_minor_version": 9,
    "fw_type": 0,
    "fw_update_prio": 5,
    "fw_weblink_relnote": "https://ww8.ikea.com/ikeahomesmart/releasenotes/releasenotesltd.html"
  }
]

Only updates are:

  • 10005777-4.1-TRADFRI-control-outlet-2.0.022.ota.ota.signed
  • 10005778-4.1-TRADFRI-onoff-shortcut-control-2.0.020.ota.ota.signed
  • 10032198-2.1-TRADFRI-gateway-1.8.25.p.elf.sig.ota.signed
  • 10038562-TRADFRI-sy5882-bulb-ws-2.0.022.ota.ota.signed
  • 159699-2.1-TRADFRI-remote-control-1.2.223.ota.ota.signed

So mostly focused on the outlets/gateway.

Today one of my hue bulbs showed the exact same issue. Had to powercycle it to become responsive again.

Any news for who updates deCONZ and the firmware of the Conbee?

Untill nowl 1 or 2 GU10s will go offline in 2-4 weeks. Kind of sporadically, but powercycling is annoying because I do not have a powerswitch on a few places.

Just updated the firmware to 0x26330500 so let's see.

Unfortunately, today one of my GU10's is unavailable..
Using deCONZ 2.05.64 with firmware 26330500 on the Conbee.

I think it was concluded that this is a bulb FW problem, as we see the exact same problem on a Philips HUE Bridge. I noticed in the Trådfri App, Release notes section , a mention of a CT v2 bulb. Maybe it is even bulb HW related?

I have had no issues with 2.05.65 so far.

Same, I haven't faced this issue since weeks

I had the same thing with the latest version of deconz until now.. switched some lamps around a few days ago. I just had to manually switch it off and on again one of these lamps for it to become responsive again.

A few GU10's and a E27 bulb went offline a few days ago. Only power cycling brings them back online.

Maybe relevant: This happened during my holiday when there where no lights switched for a day or 10.

Firmware 26330500 with deCONZ 2.05.64

How about this issue for those who said they did not have any problem for a few weeks a couple of weeks ago?

I still have this issue. Sometimes none of them goes offline, and out of the blue a GU10 is offline, and a few days later another one..

As far as I can see, there is still no new firmware available for the Tradfri GU10.

I still have this issue as well. As, for me, group commands still work most of the time even if I cant control the light as a single device I use these in my wall-switch-scripts as a workaround.

Group commands works for me too, but only a few times. After that, the light keeps on or off and doesn't respond anymore. Not even when using a binded Hue Dimmer.

Same for me. I get the impression that the issue is delayed, more than solved.
I have also worked around the issue with using group commands.

I do notice by the way that some lights seem much more susceptible to this issue. Do you recognize this?

@djwlindenaar Hard to say, but I guess half of the 25+ GU10 I have did not ever had this problem and I'm sure a few had the problem multiple times.

@peer69 @djwlindenaar Does the group command keeps working? Today a GU10 went offline and did not even respond on a group command the first time (group command from deCONZ and a binded Hue Dimmer).

The group command usually keeps working, but I do remember an occasion where I had to power cycle a light. Also I remember one time when it did not respond for a while and without any action started to respond again later.

All of the times sooner of later also a group command failed and had to powercycle the light.

What do you mean with a while? A few hours or a few days?

Not sure, probably hours, maybe a day. I just forgot about it for some time, but noticed it would be back on the next occasion I used the light.
It was related to a rule triggered by motion, so...

Waited for about 2 days when a GU10 went offline earlier this week, but it did not come back. A powercycle was needed, the GU10 was not warm at all.

Now, I have one GU10 lights which went offline, but don't come back online after a power cycle. Also pressing 0 in VNC does not solve this issue.

Also a group command or binded Hue DImmer Switch is not able to turn on/off the light anymore.

Starting last night i have in the hallway 4 ikea lights which go out after x time by a timer (Home Assistant) now 1 of the 4 lights keeps on and can't be turned off by Home Assistant/Deconz. It's offline according to Deconz and haven't rejoined in 7 hours. Will try to powercycle and see if that does the trick.

Light not working: TRÅDFRI bulb E27 opal 1000lm with version 1.2.214
Lights working: TRÅDFRI bulb E27 opal 1000lm with version 1.2.214
Firmware: 26330500
Deconz: V2_05_69

I believe there is an hardware issue with the Ikea Trafri GU10 white spectrum bulbs. Also in my setup (Ikea gateway connected via local api to Home Assistant) at least half of my 17 bulbs goes offline each couple of days. The only solution is to power cycle the bulbs.

This might be the reason for which Ikea has launched a new hardware version that runs a different version of the firmware (2.* instead of 1.*).

As far as I know, there is no no new firmware update for the old version, this might be a sign of something hardware related.

Not sure what is the warranty, but I am planning to ask for replacement with the new version.

I believe there is an hardware issue with the Ikea Trafri GU10 white spectrum bulbs. Also in my setup (Ikea gateway connected via local api to Home Assistant) at least half of my 17 bulbs goes offline each couple of days. The only solution is to power cycle the bulbs.

This might be the reason for which Ikea has launched a new hardware version that runs a different version of the firmware (2.* instead of 1.*).

As far as I know, there is no no new firmware update for the old version, this might be a sign of something hardware related.

Not sure what is the warranty, but I am planning to ask for replacement with the new version.

Not sure if it is related to only one type, because I am using the warm white GU10, not the white spectrum.

Had same issue with two IKEA GU 10 WW today. Power cycling did not help. It did help to powercycle and restart deconz. Maybe some neighbor table issue? Both lights are in a group of two and always controlled as a group.

What a coincidence that this is happing at several installation these days... Or something else is happening?

@peer69

Which version of deCONZ and conbee firmware you are using?

I am using:
deCONZ V2_05_66
Firmware 26330500

Same firmware here. deCONZ Version 2.05.69.

And another Tradfri GU10 warm white losts its connection..

image

There are newer firmware files in the testing branch, but I don't know if these fix the issues or cause any new problems perhaps even damage.

http://fw.test.ota.homesmart.ikea.net/feed/version_info.json

10046695-1.1-TRADFRI-light-unified-w-2.1.022.ota.ota.signed 10038562-TRADFRI-sy5882-bulb-ws-2.0.022.ota.ota.signed 10038562-TRADFRI-sy5882-bulb-ws-2.0.023.ota.ota.signed 10035514-TRADFRI-bulb-ws-2.3.007.ota.ota.signed 10035534-TRADFRI-bulb-ws-gu10-2.3.007.ota.ota.signed 159695-TRADFRI-bulb-ws-1000lm-2.3.007.ota.ota.signed

@manup These are for white spectrum GU10 lights and, I guess, for the new version? https://www.ikea.com/nl/nl/p/tradfri-led-lamp-gu10-400-lumen-draadloos-dimbaar-warm-wit-60420041/

Perhaps the 10046695-1.1-TRADFRI-light-unified-w-2.1.022.ota.ota.signedcan be used for them. I'll test it on one of my IKEA E27 dimmable lights, hopefully the worst case is crash but not burn :)

I've tried to assign the file manually but the bulb doesn't seen to pickup the update (doesn't start).

Just a 'me too' on this. My entire house is Tradfri and after rebooting my server today I've lost the entire network for the 3rd time in as many days. Power-cycling hasn't made a difference. I've only got Tradfri in my house so I don't have anything else to compare to. I've a mixture of e27 white, e27 colour and GU10's. Only resolution I've found is to hard reset the bulbs and start the whole thing from scratch - sadly this isn't sustainable.

image

Conbee Firmware: 264A0700
Light firmware: 1.2.214

Happy to help with any diagnostics but I need to put most of my house back to Tradfri gateway so my family can turn lights on and off. I'll leave a couple of lamps available.

I just realized a, maybe, interesting addition to this issue.

Several rooms are equipped with multiple Tradfri GU10 spots and most of the rooms with a Hue Dimmer Switch, with a direct binding to the GU10's in the room created by adding them to the Hue Dimmer Switch group in the Phoscon app.

  • Only GU10's with a direct binding with a Hue Dimmer Switch goes offline every now and then and need a power cycle to come back online.
  • GU10's without a direct binding with a Hue Dimmer Switch do not go offline.

Doesn't seem to be the case for me. Until now only Hue Bulbs seem to be pretty resilient. All kinds of IKEA devices (GU10, E14, FLOALT Panel) and Osram devices (Bulbs, Lighstrips and Garden Poles) i own have failed. Especially Osram devices fail completely, power cycle does nothing. They have to be removed and repaired which is a pain for these devices.

I spent some days thinking about replacing all my zigbee devices with devices for my way too expensive KNX-installation after several devices failed and migration to a Conbee II stick also failed probably because of a somewhat corrupted zll-db. I then took the radical approach to set up everything from scratch. I reset the gateway and repaired all devices. As for IKEA devices this was way easier than it used to be. In the past I had to move them close to the gateway, reset them several times, use touchlink to make them respond - nothing of that this time. I just reset the bulbs and they paired without any issues. Still, doing this for 70 devices was a pain and some xiaomi sensors made it a bit more difficult than the ikea stuff. But it makes me already think I might have a more stable network now. Yet nothing has failed but I will tell you if it does.

It has been a while since I updated my raspbee, my old version had issues with reading the lights temperature, so I decided to update to the latest firmware and latest deCONZ. Got only 2 lights(OSRAM 73674). 1 light (probably random) loses connection after a few minutes with the latest deCONZ and I have to manually turn off and on again. The only version I could find that was able to read temperatures and didn't lose lights was deconz-2.04.40-qt5.deb which is super old, but works for my basic use.

Happens a lot more lately for me too with IKEA GU10 single color. Kind of a pain in the butt. However the nodes do not become red for me, only greyed out. They will react to remote button presses, but not via Phoscon or HA.

Does deCONZ still update the light state when you control them from the remote?

This typically happens when deCONZ has lost the route to the lamp. The lamp is still on the network, can reach the gateway, and receives broadcasts. However, the gateway cannot reach the light over unicast. As a workaround, you might use /groups commands instead of /lights, so deCONZ send broadcasts instead.

I'm still having the same problem occasionally (~once a week) with my Xiaomi Curtain Controller (which is a ZigBee router). The _Device Announment_ after power cycling the device causes deCONZ to relearn the route.

These routing issues where introduced with the support for the routing discovery used by Trådfri lights, so 2.04.40 sounds about right.

@ebaauw When a light gets unavailable, group commands works for me but only for a limited time. After that, the light keeps on or off and doesn't respond to a group command anymore. Using a binded Hue Dimmer Switch (binding created by adding the lights to the by default created group when a Hue Dimmer Switch is paired in the Phoscon app).

Does deCONZ still update the light state when you control them from the remote?

This typically happens when deCONZ has lost the route to the lamp. The lamp is still on the network, can reach the gateway, and receives broadcasts. However, the gateway cannot reach the light over unicast. As a workaround, you might use /groups commands instead of /lights, so deCONZ send broadcasts instead.

I will check these suggestions next time. But yes they usually come back it I cut the power for at couple of minutes. Btw I use the IKEA "hockey puck" remote.

I'm still having the same problem occasionally (~once a week) with my Xiaomi Curtain Controller (which is a ZigBee router). The _Device Announment_ after power cycling the device causes deCONZ to relearn the route.

These routing issues where introduced with the support for the routing discovery used by Trådfri lights, so 2.04.40 sounds about right.

Hmm I don't this it has been this bad for me for that long. I have kept it up to date continuosly.

Hmm I don't this it has been this bad for me for that long.

It's an intermittent problem. I haven't been able to reproduce it at will, see #849. In fact, all my rules to control lights are based on /groups actions, so I never noticed the issue before, until I got the curtain controller.

When a light gets unavailable, group commands works for me but only for a limited time. After that, the light keeps on or off and doesn't respond to a group command anymore.

I would theorise that the light decides to leave the network, when it doesn't receive ACKs from the gateway for it's attribute reports for an extended time.

I would theorise that the light decides to leave the network, when it doesn't receive ACKs from the gateway for it's attribute reports for an extended time.

For me, this happens also with lights that where turned on or of only af few minutes/hours before.

As posted a few replies above, most of my GU10 lights are binded to a Hue Dimmer Switch (3-8 in a room with an own Hue Dimmer Switch) . A few are not, and those are online since months. Never one of them went unavailable. Coincidence, or not.. What do you think about this @ebaauw?

most of my GU10 lights are binded to a Hue Dimmer Switch

That seems unlikely. Did you create a binding in _Bind Dropbox_ panel in the deCONZ GUI from the dimmer switch to the light?

Note that "binding" in ZigBee terms means: creating an entry in the device's binding table to what ZigBee address to send commands from the bound cluster to. I think the (client!) _OnOff_ and _Level Control_ clusters of your dimmer switch are bound to a group (by the deCONZ REST API plugin). This group is listed in the ZHASwitch resource under config.group. Next, the light has been added to this group. ZigBee groups are more like multicast addresses, so it's probably more correct to say the the light has subscribed to messages to this group.

In theory, I suppose, the light firmware could have a bug, causing it to hang while processing commands received through a group. I don't think this is likely, though. I'd look at how close (how many hops over the ZigBee network) the lights are to the RaspBee/ConBee and/or whether all lights have the same hardware and firmware revision. IKEA seem to be rolling out new ZigBee 3.0 (ZHA) versions of all their devices.

most of my GU10 lights are binded to a Hue Dimmer Switch

That seems unlikely. Did you create a binding in _Bind Dropbox_ panel in the deCONZ GUI from the dimmer switch to the light?

Note that "binding" in ZigBee terms means: creating an entry in the device's binding table to what ZigBee address to send commands from the bound cluster to. I think the (client!) _OnOff_ and _Level Control_ clusters of your dimmer switch are bound to a group (by the deCONZ REST API plugin). This group is listed in the ZHASwitch resource under config.group. Next, the light has been added to this group. ZigBee groups are more like multicast addresses, so it's probably more correct to say the the light has subscribed to messages to this group.

No, I did not create a binding in de _Bind Dropbox_ panel in the deCONZ GUI.

Do you mean adding lights to these groups in the Phoscon app does not create a binding between remote and light(s)? I assumed so, because switching the added lights with the corresponding Hue Dimmer Switch is also possible when my deCONZ docker container or entire NUC is offline.

image

Ok I have an unresponsive bulp now. It was greyed out in bth deCONZ and Phoscon.

It did react to remote commands and also to when I moved the slider in "group" mode in Phoscon,

Does deCONZ still update the light state when you control them from the remote?

Not at first since it is was greyed out.
image

I then tried with "reset the selected node" in deCONZ, and then it was for a few minutes not greyed out anymore even though it did not have the cluster radio button anymore in deCONZ.
image

Still only responded to remote and group commands, but now the light state was updated.
image

After a while it greyed out again in Phoscon.

I then tried cutting the power for a few minutes. Didn't help.

An interesting situation occurred tonight:

One of the Tradfri warm white GU10 lights is offline again since a few hours. The Hue Dimmer Switch which is binded to 8 GU10s including the one which is offline now, turns on/off the, according to deconz, offline GU10. Strange enough, only that one.
The other GU10s don't respond to the binded Hue Dimmer Switch.

Switching the group with the rest API will turn on/off all the lights, including the offline GU10.

When a GU10 went offline before, the binded Hue Dimmer Switch switches all the lights in the group on/off, including the, according tot deconz, offline GU10 (sooner or later the, according to deconz, offline GU10 also stops listening to group commands either).

Could this be useful?

\Edit: The 'Name' of this offline GU10 in deCONZ via VNC is empty.

image

When filling this in again, it is not set (never did this here, normally in the Phoscon App):

image

Still yellow either:

image

Pressing '0' or 'Reset selected node' does not bring the GU10 back online.

Edit 2: About 24 hours after the GU10 went offline, the GU10 isn't responding to the binded Hue Dimmer Switch anymore as described before. None of the GU10 lights in the group aren't responding now. The node in deCONZ GUI is still yellow but the name is in grey.
After powercycling the offline GU10 light, all the 8 lights in the group, including the one which was offline, are responding to the binded Hue Dimmer Switch again.

@manup This behaviour/situation is quite different then before, maybe this can help to find the cause?

I just realized a, maybe, interesting addition to this issue.

Several rooms are equipped with multiple Tradfri GU10 spots and most of the rooms with a Hue Dimmer Switch, with a direct binding to the GU10's in the room created by adding them to the Hue Dimmer Switch group in the Phoscon app.

  • Only GU10's with a direct binding with a Hue Dimmer Switch goes offline every now and then and need a power cycle to come back online.
  • GU10's without a direct binding with a Hue Dimmer Switch do not go offline.

Unfortunately, one of my Tradfri GU10 warmwhite without a binded Hue Dimmer Switch is offline since yesterday..

Still wondering why a few GU10 spots (same warmwhite) are _never_ offline.
\Edit: Also a GU10 which was online for months, is offline since yesterday.

Bought a Tradfri Hub last weekend and paired all the GU10 spots of one room to it. Lets see what happens the coming weeks..

Random lights in my system started to be non-responsive after I added four IKEA on/off outlet switches to my system (for the xmas window lights)... The outlets are distrubuted throughout my house. At least one of my lights in my system needs to be power cycled per day.

Can I provide any type of logs or info to ease the troubleshoot?
image

@tubalainen What do you mean with non-responsive? Are they getting responsive by themselves of do they need a powercycle?

@tubalainen What do you mean with non-responsive? Are they getting responsive by themselves of do they need a powercycle?

They are listed as "online" (not greyed out) and are part of the mesh according to my picture but when toggled in Home Assistant (or via phoscon) nothing happens. To get them working again I do need to powercycle.

Is there any progress on this? It's getting a bit ridiculous with my GU10 lights going grey :(

I think I saw someone mention the curtain controller or on/off switch. It actually occors to me that problems got a lot worse around the time I got the FYRTUR blinds. This was also the first ikea button I added to the network... I donno...

I do not have any Tradfri on/off wall switches or IKEA curtains here and do have the problem.

I had this problem during the summer. My GU10s would drop out all the time. I upgraded all my GU10s to the latest "test" software and have been running without problems for almost two months.

Last weekend I added a couple of new E27 bulbs and now several of my GU10s are dropping out all the time again.

I had this problem during the summer. My GU10s would drop out all the time. I upgraded all my GU10s to the latest "test" software and have been running without problems for almost two months.

Last weekend I added a couple of new E27 bulbs and now several of my GU10s are dropping out all the time again.

Which GU10 bulbs do you have and which firmware did you install?

Bought a Tradfri Hub last weekend and paired all the GU10 spots of one room to it. Lets see what happens the coming weeks..

When I used the Tradfri gateway I had unresponsive bulbs almost everyday (Tradfri GU10 white spectrum, first generation). I switched to the Philips Hue bridge and the problem seemed gone, but after 2 week one bulb became unresponsive (as usual, a power-cycle solved the issue).

Not sure why and how the Hue bridge works better with the Tradfri bulbs, but it seems to me that the real issue is with the bulbs.

I do not own a single GU10 and still got the issue.

Please tell me how I can be of any assistance with logs etc to try to solve this.
It seems lite (my guess here) that some nodes "falls asleep" and doesn't wake up on the first try to wake them?

Is this only an issue when there are several node hops in the mesh? For me it is quite consequent when there is more than one hop to the conbee. Again, just a feeling. Not confirmed.

I have 8 GU-10 about 18 months old. One of them has gone gray once in that time. They are all in the same room as the conbee stick. I have some 3 years old 980lm bulbs. The one farthest from the gw jumping via a gu 10 goes gray maybe once every fortnight.
@tubalainen might be onto something.

Next to the one going gray, I have a Osram that has never went awhol.

Yes @tubalainen might be onto a clue. I have actually been putting my grey GU10s into a lamp socket on a cord I use when adding new lights and when they are in that they have many times just come back to life. But then when I put them back in their place they are out again. It's not 100% consistent that they do come back, but still might be an accomplice thing.

Here is another potential clue. I've never had any lamps going grey ever (in over a year of up-time with 8 IKEA lamps), but then I updated the Home Assistant deconz plugin from 3.8 to 3.9 and in two days I had two GU10 going grey, I restored a backup of the 3.8 version of the plugin and it's been no problems since then. The strange thing is that the 3.8 and 3.9 plugin uses the same deconz version (if the change log is complete). Yet, even in Phoscon the lamps were inaccessible. This is all rather strange, but I assume that the 3.9 version of the plugin makes some kind of request to deconz that makes the IKEA lamps unstable.

I dont use HA and still suffer the issue. I have bought some ne GU10 and now replace every lamp that goes grey more than once. Any lamp I have replaced yet did not become grey again. Might be a hardware issue, might not but it seems like we wont have a solution anytime soon.

I got similar issues. Some of the light nodes don't respond to commands and become grey. The strange thing is that if I send group commands, the same light node respond perfectly every time.

I got similar issues. Some of the light nodes don't respond to commands and become grey. The strange thing is that if I send group commands, the same light node respond perfectly every time.

Does a group command keeps working after lets say a few days / more then a week?
In my case: sooner or later the light also stops listening to group commands.

I got similar issues. Some of the light nodes don't respond to commands and become grey. The strange thing is that if I send group commands, the same light node respond perfectly every time.

Does a group command keeps working after lets say a few days / more then a week?
In my case: sooner or later the light also stops listening to group commands.

Yes it seems to work fine over time as well. But I have a theory of whats wrong. When light nodes are routed through TRADFRI bulb E27 WS opal 1000lm the problems occur. So yesterday I powered off the two I had i my network. Since then everything seems to work perfectly.

ScreenClip  4

I have TRADFRI bulb GU10 WS 400lm, TRADFRI bulb E14 CWS opal 600lm and TRÅDFRI bulb E27 CWS opal 600lm. They don't seem to cause any trouble.

It could be related to the older Trådfri firmware. More recent lights come with newer firmware, but I don’t think IKEA have published an updated firmware for all the first-generation lights. I don’t have any Trådfri GU10 spots, but by CWS bulb did receive a firmware update. My 1000lm dimmable bulb didn’t; I’m not sure about the white spectrum bulb.

It could be related to the older Trådfri firmware. More recent lights come with newer firmware, but I don’t think IKEA have published an updated firmware for all the first-generation lights. I don’t have any Trådfri GU10 spots, but by CWS bulb did receive a firmware update. My 1000lm dimmable bulb didn’t; I’m not sure about the white spectrum bulb.

My TRADFRI bulb GU10 WS 400lm are on the same firmware, 2.0.022. They don't seem to cause problems (yet :) )

It could be related to the older Trådfri firmware. More recent lights come with newer firmware, but I don’t think IKEA have published an updated firmware for all the first-generation lights. I don’t have any Trådfri GU10 spots, but by CWS bulb did receive a firmware update. My 1000lm dimmable bulb didn’t; I’m not sure about the white spectrum bulb.

I agree. The first version of Ikea bulbs (on 1.* firmware) had some software/hardware bug that doesn't affect the second iterations (on 2.*).

Sadly it seems that Ikea doesn't support any more the original version and won't provide any firmware update. At the same time it is not possible to return the bulbs and swap for the new ones (I've tried, in the UK).

It could be related to the older Trådfri firmware. More recent lights come with newer firmware, but I don’t think IKEA have published an updated firmware for all the first-generation lights. I don’t have any Trådfri GU10 spots, but by CWS bulb did receive a firmware update. My 1000lm dimmable bulb didn’t; I’m not sure about the white spectrum bulb.

I agree. The first version of Ikea bulbs (on 1.* firmware) had some software/hardware bug that doesn't affect the second iterations (on 2.*).

Sadly it seems that Ikea doesn't support any more the original version and won't provide any firmware update. At the same time it is not possible to return the bulbs and swap for the new ones (I've tried, in the UK).

I have cws bulbs on 1.3.009. They don't seem to cause problems.

Update:

It seems like other IKEA bulbs are causing trouble as well.

Here are my IKEA bulbs. So far it's only the TRADFRI bulb E27 WS opal 1000lm that is causing problems. They are now disconnected and things seem to work. But I post an update if i'm getting new errors.

Skärmklipp tradfri

image
I don't seem to own any 2.x devices.
Some GU10 work fine, some did not. I started replacing the faulty ones with "new" ones I bought a few months later. They seem to run the same firmware though there is a number "1808 -S" printed on the socket while the "newer" ones have "1729-S".
I replaced 3 bulbs so far and no new issues occured for 2 weeks. I did not only have the problem with GU10 but also with a FLOALT Panel which I did not replace.

image

I don't seem to own any 2.x devices. Some GU10 work fine, some did not. I started replacing the faulty ones with "new" ones I bought a few months later. They seem to run the same firmware though there is a number "1808 -S" printed on the socket while the "newer" ones have "1729-S". I replaced 3 bulbs so far and no new issues occured for 2 weeks. I did not only have the problem with GU10 but also with a FLOALT Panel which I did not replace.

Interesting! I will take a look on my bulbs to see if there is some kind of correlation.

It seems like one of my nodes are affected by the problem again. I still have some IKEA nodes in the network so I will do some tests to see if I can draw some conclusions.

I have TRADFRI bulb GU10 WS 400lm, TRADFRI bulb E14 CWS opal 600lm and TRÅDFRI bulb E27 CWS opal 600lm. They don't seem to cause any trouble.

It seems like after the mesh have been up for a while even the other IKEA devices are causing trouble and nodes stop to respond on commands. Even other IKEA nodes can be affected, to me it seems like the issues start to appear then nodes are routed through a IKEA node.

Is there any activities planned to look in to this? I will move all my IKEA devices to my hue gateway while this is resolved.

nodes stop to respond on commands.

Are you sure this is the case? Have you confirmed this by sniffing the ZigBee traffic? Do the other nodes no longer react to wireless switches that control lights without going through the gateway? Do they no longer react to group commands?
In my experience, the issue is that the gateway no longer knows the route to the other nodes, so unicast commands from the gateway never reach the other nodes. The nodes themselves are responding just fine, they just don't receive anything to respond to.

Even other IKEA nodes can be affected, to me it seems like the issues start to appear then nodes are routed through a IKEA node.

The (not?) routing IKEA node somehow breaks the route from and/or route discovery by the gateway to the other nodes.

Is there any activities planned to look in to this?

You'll have to ask dresden elektronik support. The routing is handled by the RaspBee/ConBee firmware (and the deCONZ core programme?). Nothing we can do about this in the REST API plugin.

Bought a Tradfri Hub last weekend and paired all the GU10 spots of one room to it. Lets see what happens the coming weeks..

Maybe it is a bit early to say, but 29 days ago I have moved all Tradfri GU10s (8x warmwhite - 1.2.214) from one room to a Tradfri Hub, and none of them became unresponsive untill now.

Maybe it is a bit early to say, but 29 days ago I have moved all Tradfri GU10s (8x warmwhite - 1.2.214) from one room to a Tradfri Hub, and none of them became unresponsive untill now.

I would expect so. The problem is not so much caused by devices from a single manufacturer, as by the interaction between devices from different manufacturers, interpreting the ZigBee standard differently. That's why most hubs/gateways/bridges only support their own devices. A more interesting experiment would be to connect the Trådfri spots a to Hue bridge or OSRAM gateway.

Also, eight lights in a single room will probably result in a direct connection between the hub and each light: all single-hop connections. Routing issues appear only with larger networks, with more devices than entries in a neighbour table (typically around 20); and/or with devices physically out of range from the hub.

nodes stop to respond on commands.

Are you sure this is the case? Have you confirmed this by sniffing the ZigBee traffic? Do the other nodes no longer react to wireless switches that control lights without going through the gateway? Do they no longer react to group commands?
In my experience, the issue is that the gateway no longer knows the route to the other nodes, so unicast commands from the gateway never reach the other nodes. The nodes themselves are responding just fine, they just don't receive anything to respond to.

I have no experience or equipment to sniff the traffic. You are probable correct about how the issue are behaving. My nodes do respond to group commands all the time. The problem is (if I get it) that the unicast commands never reach it's destination.

As long as I only send group commands everything seems to work quite well.

Even other IKEA nodes can be affected, to me it seems like the issues start to appear then nodes are routed through a IKEA node.

The (not?) routing IKEA node somehow breaks the route from and/or route discovery by the gateway to the other nodes.

Is there any activities planned to look in to this?

You'll have to ask dresden elektronik support. The routing is handled by the RaspBee/ConBee firmware (and the deCONZ core programme?). Nothing we can do about this in the REST API plugin.

Yeah I have sent email to them, but did not get any response besides some checklist.

Maybe it is a bit early to say, but 29 days ago I have moved all Tradfri GU10s (8x warmwhite - 1.2.214) from one room to a Tradfri Hub, and none of them became unresponsive untill now.

I would expect so. The problem is not so much caused by devices from a single manufacturer, as by the interaction between devices from different manufacturers, interpreting the ZigBee standard differently. That's why most hubs/gateways/bridges only support their own devices. A more interesting experiment would be to connect the Trådfri spots a to Hue bridge or OSRAM gateway.

Also, eight lights in a single room will probably result in a direct connection between the hub and each light: all single-hop connections. Routing issues appear only with larger networks, with more devices than entries in a neighbour table (typically around 20); and/or with devices physically out of range from the hub.

I have done this to ensure it is not related to the firmware 1.2.214 of the old GU10 which is mentioned as a possible cause here a lot (and possible reason to make a new version of the GU10).
Maybe in relation with connected via another router.

I will consider pairing my other GU10 (also warmwhite 1.2.214) lights to the Tradfri Hub as well (also those on the 2de/3th floor)

Is there any progress on this? It's getting a bit ridiculous with my GU10 lights going grey :(

I think I saw someone mention the curtain controller or on/off switch. It actually occors to me that problems got a lot worse around the time I got the FYRTUR blinds. This was also the first ikea button I added to the network... I donno...

I have not had the fyrtur on/off switch in the network since my last post and no dropouts in that time. Might be a coincidence ... ?

Is there any progress on this? It's getting a bit ridiculous with my GU10 lights going grey :(
I think I saw someone mention the curtain controller or on/off switch. It actually occors to me that problems got a lot worse around the time I got the FYRTUR blinds. This was also the first ikea button I added to the network... I donno...

I have not had the fyrtur on/off switch in the network since my last post and no dropouts in that time. Might be a coincidence ... ?

No, I do not have any Tradfri on/off or Tradfri curtains and have this issue.
https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-562299982

Accidentally closed this issue. Reopening.

A more interesting experiment would be to connect the Trådfri spots a to Hue bridge or OSRAM gateway.

I had the same issue with Trådfri bulbs (first generation) connected to Trådfri hub. The problem completely disappeared when I moved all the bulbs to a Hue bridge.

BTW - I am still convinced that the first gen of Trådfri bulbs are bugged, but the Hue bridge handles the lights differently (I have noticed that sometime the lights go out of sync for a fraction of second when connected to Hue - never happened with Trådfri).

the Hue bridge handles the lights differently (I have noticed that sometime the lights go out of sync for a fraction of second when connected to Hue - never happened with Trådfri).

In the Hue setup, lights are only ever controlled by the Hue bridge: wireless switches and sensors send notifications to the bridge, triggering rules that control the lights. The bridge predicts the light state and updates its cache upfront when sending commands to the lights. It polls the lights periodically to validate/update the cached state.

In the Trådfri setup, lights are controlled directly by commands from wireless switches and sensors. The lights then send a report with updated state to the hub.

When controlling lights connected to a Hue bridge directly from (Trådfri) wireless switches or sensors, you'll see a delay because the Hue bridge doesn't setup attribute reporting on the lights. The more lights in your network, the longer the delay, as the polling cycles through more lights.

When connecting Hue bulbs to a Trådfri hub, and controlling those lights directly from (Trådfri) wireless switches or sensors, the hub state will never be updated, as Hue lights don't do attribute reporting and the Trådfri hub doesn't do polling.

Note that these differences are at application level, well within the control of the REST API plugin. The routing is at a lower level in the ZigBee stack.

Now all my IKEA router nodes are moved to my hue bridge. I still have the same issue in my network. So I suspect that it's something else that is causing it. And again it affects nodes that are routed and have no direct link to the gateway.

Now all my IKEA router nodes are moved to my hue bridge. I still have the same issue in my network. So I suspect that it's something else that is causing it. And again it affects nodes that are routed and have no direct link to the gateway.

The same problem with the Tradfri lights which are now paired to the Hue bridge?

@ebaauw should the routing issue appear when a light is connected to the bridge directly AND via a router, or only when it not directly connected to the bridge?

Now all my IKEA router nodes are moved to my hue bridge. I still have the same issue in my network. So I suspect that it's something else that is causing it. And again it affects nodes that are routed and have no direct link to the gateway.

The same problem with the Tradfri lights which are now paired to the Hue bridge?

I have only 3 bulbs paired. But no issues that I have noticed.

@ebaauw should the routing issue appear when a light is connected to the bridge directly AND via a router, or only when it not directly connected to the bridge?

In my network my expirince is that issues with lost commands are affecting nodes that are routed through other nodes and not directly with the gateway. I have the same issues now even when no IKEA bulbs are connected.

I had one node(1) with no direct link to the gateway. It did not respond(or as ebaauw mentioned, maybe they never reach the node due to routing issues) to unicast commands. When I powered down the node(2) it was routed trough the node(1) did get a direct route to the gateway and started working perfect.

should the routing issue appear when a light is connected to the bridge directly AND via a router, or only when it not directly connected to the bridge?

There's no such thing as "connections" in ZigBee. Each router keeps a neighbour table of nearby devices, which is can reach directly (at MAC level). The lines in the deCONZ GUI are nothing more than a graphical representation of these neighbour tables (in so far as deCONZ has actually read these).

When a router needs to send a message to a device that's not in its neighbour table, it sends the request to a neighbouring router, so they can forward it. The single message at NWK level is re-transmitted over multiple hops at MAC level. The RaspBee/ConBee (in this perspective) is just another router. Afaik an end device only sends messages to its parent router.

So when a light is in the RaspBee/ConBee's neighbour table, no route discovery is involved, irrespective of whether that light is the neighbour table of other routers. When it isn't, the RaspBee/ConBee needs to figure out which neighbouring router knows how to reach the light. If the light is multiple hops away, each router on the path to it recursively needs to do the same.

I'm afraid the details are lost on me, but there's multiple ways to do this route discovery. When routers on the path between the gateway and the lamp use different ways, things can go wrong. The gateway might ending up sending the message to the wrong neighbour router (that doesn't know how to reach the light), or the gateway might not know how to reach the light at all, and won't send any message at all. Unicast messages generally should be answered with an ACK, so the gateway will mark a device as unreachable when it doesn't receive the ACK. Of course, there's no way of knowing whether the original message or the ACT was lost (the remote device needs to know the route to the gateway to send the ACK).

Broadcast messages (used by groups commands) don't use any routing; they're simply re-broadcasted at MAC level by all routers. While very reliable, they do consume a lot of network bandwidth. Also, they're not answered by an ACK.

In my network my expirince is that issues with lost commands are affecting nodes that are routed through other nodes and not directly with the gateway. I have the same issues now even when no IKEA bulbs are connected.

So for routing issues, you need a network that's large enough for devices not to appear in the RaspBee/ConBee's routing table. Either because of the number of devices exceeds the size of the routing table, or because remote devices are physically out of range of the RaspBee/ConBee, and can only be reached through other routers. I suppose multiple hops would make the issue worse.

Again, it's not so much IKEA's fault as it is different manufacturers using different, somewhat incompatible ways to do route discovery. Especially with multi-hop routes, there's only so much that can be worked around in the RaspBee/ConBee firmware.

I had one node(1) with no direct link to the gateway. It did not respond(or as ebaauw mentioned, maybe they never reach the node due to routing issues) to unicast commands. When I powered down the node(2) it was routed trough the node(1) did get a direct route to the gateway and started working perfect.

Again, I'm afraid the details are lost on me, but I can think of multiple scenarios that would cause this behaviour. The RaspBee/ConBee wants to send a message to node2 or node1, doesn't receive an ack, concludes that node2 is unreachable and removes it from its neighbour table. The table now has room for a new entry, which is used for node1.

Again, it's not so much IKEA's fault as it is different manufacturers using different, somewhat incompatible ways to do route discovery. Especially with multi-hop routes, there's only so much that can be worked around in the RaspBee/ConBee firmware.

This sounds to me like the approach of a multi vendor ZigBee platform is doomed by design until a more thorough standard is established and followed. Still I am kind of locked in right now as I don’t want to use multiple gateways which probably wouldn’t work in my environment anyway (e.g. for sensors which can’t be reached directly by the gateway if I remove bulbs which currently act as routers).

This sounds to me like the approach of a multi vendor ZigBee platform is doomed

I surely hope not, but it's definitely challenging. Size does seem to matter here: if your network is small enough you might not run into these routing issues at all. Also, if the (geographical) centre of your network consists of routers by compatible vendors, some less compatible routers on the edge of your network probably won't matter (since they're not routed through to reach other devices).

I did replace my OSRAM, Trådfri and innr bulbs (that were causing issues) by Hue bulbs. I now have 104 nodes, 51 routers and 53 end devices. 2/3 of the routers are Hue lights. The only router becoming unreachable (but still sending reports) is the Xiaomi curtain controller. 20 of the end devices are Hue, 20 Xiaomi, 8 Eurotronic Spirit and 5 IKEA. The Eurotronic Spirit and IKEA Fyrtur are causing me issues regularly (they're kicked out by their parent, but don't find a new parent - again unreachable by the gateway, but still sending reports), the others seems fine. For all devices, power cycling usually remedies the situation, but sometimes I need to open the network when power cycling the device.

I have been considering to split my network, not by vendor, but by floor or street-side vs back-side of my house. This is a lot of work and I'm reluctant to do so, until I have a better understanding what's causing the issues in detail, and what setup might prevent them from happening.

@ebaauw , thinking about what you said in the above post, It would be interesting to consider a kind of preference for the deCONZ routing table. Could we consider 'blacklisting' routers that are not preferable as a first hop. If you can build your network to contain sufficient 'good' routers, so the first hop would always be through those and the second hop would reach all the other devices in the network.
That would not be a definitive solution, but would allow for much bigger (geographical and number of devices) networks while always going through 'preferred' routers.

Another thought: since we know which chips are used by IKEA (Silabs EFR32, gecko engine) and we know (from various sources on the net) that they use the zigbee engine provided by Silabs, we could consider two actions.

  1. We analyse the way the silabs engine does routing and from that derive what is causing this issue
  2. We build a custom version of the lamp firmware based on the same silabs engine, fix the issue and figure out a way to OTA install that firmware

Both would require access to the Silabs SDK, which I don't. Is somebody active here who does..? I'm not looking forward to shelling out $500 for their development kit which includes the SDK. Maybe Dresden Electronics is willing to do that?

my 2 cts (being rather frustrated with the occasionally lost connections).

Could we consider 'blacklisting' routers that are not preferable as a first hop.

We: no. It would have to be implemented in the RaspBee/ConBee firmware, I suppose. Not a bad idea imho, but I've got no clue how feasible this is, given the size limitations in the device EEPROM and NVRAM. @manup, what do you think of this?

  1. We build a custom version of the lamp firmware based on the same silabs engine, fix the issue and figure out a way to OTA install that firmware

This is way beyond my current knowledge, and, frankly, my ambition - I do this for a hobby. I have been playing around with the XBee to see if I can write a ZigBee application (my ultimate goal would be to smartify my venetian blinds), but I've never looked into the lower levels of the ZigBee stack, where the routing takes place.

We: no. It would have to be implemented in the RaspBee/ConBee firmware, I suppose. Not a bad idea imho, but I've got no clue how feasible this is, given the size limitations in the device EEPROM and NVRAM. @manup, what do you think of this?

I guess I consider @manup part of We ;)

This is way beyond my current knowledge, and, frankly, my ambition - I do this for a hobby. I have been playing around with the XBee to see if I can write a ZigBee application (my ultimate goal would be to smartify my venetian blinds), but I've never looked into the lower levels of the ZigBee stack, where the routing takes place.

I agree, for me also it's a hobby, however I'm kind of hoping someone is more knowledgeable on these details.

I have been doing something similar by using CC2530 boards I am planning to zigbeeify my sprinkler system (I can't accept I have to walk around the garden to manually switch certain sprinklers on/off :) ) I've been able to compile a zigbee firmware for my CC2530 board which joins the deCONZ network as an on/off light. The valves I'm planning to use have a latching mechanism, so require a specific signal to switch, therefore I had to create my own firmware...

We: no. It would have to be implemented in the RaspBee/ConBee firmware, I suppose. Not a bad idea imho, but I've got no clue how feasible this is, given the size limitations in the device EEPROM and NVRAM. @manup, what do you think of this?

The Zigbee stack in the firmware has a parameter to control if a route shall be used instead of a direct link based on its quality (RSSI/LQI) this should already work pretty good and was tweaked over the years.

A further path might be a more general approach to control routing tables of all devices.

Note: You can query routing table of a router by selecting it and press R key, next-hop routes will be shown in blue.

image

Currently when the Zigbee network is opened all routers are allowed to be a parent node since the Mgmt_Permit_Joining_req is send as broadcast. However if we send this request as unicast only to a certain router, a new device can only join through this one device. This might be useful for Xiaomi end-devices which often stick to their parent. For other joining devices like routers and for example Philips Hue end-devices this won't help much since they are free to select new parents and routes on their own.

The (new) Mgmt_NWK_IEEE_Joining_List_req in the Zigbee R22 specification looks also interesting:

The Mgmt_NWK_IEEE_Joining_List_req command is provided as a mechanism to obtain the list of IEEE addresses that are expected to be joining the network. This allows the local router to filter Enhanced Beacon Requests and only respond to the devices that are joining.

Without beacons devices would not even try to join a certain router. But I'm not aware how well this request is supported by currently available routers.

A silver bullet might be using Source Routing where the gateway provides the complete route within each frame, but I've never seen this used by any consumer product or gateway and I guess therefore it might not be (or poorly) supported by current routers. Might be worth a try though.

Note: You can query routing table of a router by selecting it and press R key, next-hop routes will be shown in blue.

Cool, I didn't know that one. Is there an overview of all the keys/commands supported by the GUI? Is it possible to revert the blue lines before querying the next node? It doesn't seem to work for the RaspBee/ConBee itself?

Is the routing table different from the neighbour table? Sniffing, I only see ZDP _Link Quality Request_ and _Link Quality Response_ messages, reporting the neighbour table. Does this mean that the lines to a node that aren't coloured blue are the "incoming" routes?

@manup , Indeed an interesting visualisation

A silver bullet might be using Source Routing

Well... that would be worth a try, I guess. It seems to me that, in my network, the most affected IKEA lights are those that are at a longer distance from the coordinator. I've been suspecting that something related to routing is causing the issues, as @ebaauw mentioned.
I'd be willing to test something like this for you. Although, I have to admit that losing the connections is not very reliable, so testing might be difficult.

I was more thinking along the lines of the coordinator being selective about which router to select for the first hop. If we can find routers that do play nice with the IKEA devices (maybe those are the IKEA routers), routing all packets through those should make things more robust. Not sure if you can trick the route discovery process to prefer a certain router for the first hop, but probably you can comment on that, @manup

Note: You can query routing table of a router by selecting it and press R key, next-hop routes will be shown in blue.

Cool, I didn't know that one. Is there an overview of all the keys/commands supported by the GUI? Is it possible to revert the blue lines before querying the next node? It doesn't seem to work for the RaspBee/ConBee itself?

If someone wants to list out all the key combos and what they do, I’ll be happy to clean it up and write a wiki page.

Cool, I didn't know that one. Is there an overview of all the keys/commands supported by the GUI?

Request Node Descriptor = 1
Request Power Descriptor = 2
Request Nwk Address = 0
Request Routing Table = R
Request Mgmt Leave = L (with rejoin, don't use with innr lights!)
Request Active Endpoints = 7
Request Simple Descriptors = 8
Refresh = F5 / Cmd-R
Delete = Delete / Backspace
Gateway Device Annce = A (not that useful)

Is it possible to revert the blue lines before querying the next node?

Not yet, it was just a quick and dirty addition to see routes :)
Perhaps adding another key like Shift-R to clear the routes of a node is useful?

It doesn't seem to work for the RaspBee/ConBee itself?

Unfortunately RaspBee I / ConBee I don't support the related ZDP command, it works with ConBee II.

Is the routing table different from the neighbour table? Sniffing, I only see ZDP Link Quality Request and Link Quality Response messages, reporting the neighbour table.

These are two separat tables, usually the routing table only holds routes to nodes which aren't directly reachable and are subject for the neighbor table (1-hop nodes).

For nodes which are in the neighbor table (and have a strong signal) no route discovery is needed. The neighbor table itself is build mostly using 1-hop NWK Link Status commands.

Does this mean that the lines to a node that aren't coloured blue are the "incoming" routes?

The non-blue lines (green/orange/yellow) just represent the 1-hop neighbor table. The blue lines are out-going routes to some destination and represent only the next-hop (not the complete route) in the current view the destination NWK address isn't shown, but there are likely debug prints.

We: no. It would have to be implemented in the RaspBee/ConBee firmware, I suppose. Not a bad idea imho, but I've got no clue how feasible this is, given the size limitations in the device EEPROM and NVRAM. @manup, what do you think of this?

The Zigbee stack in the firmware has a parameter to control if a route shall be used instead of a direct link based on its quality (RSSI/LQI) this should already work pretty good and was tweaked over the years.

A further path might be a more general approach to control routing tables of all devices.

Note: You can query routing table of a router by selecting it and press R key, next-hop routes will be shown in blue.

Really useful function! Yesterday I powered on my TRÅDFRI bulb GU10 WS 400lm (7 bulbs) with firmware 2.0.022. Today the issues came back, it appeared on a TRÅDFRI bulb E27 CWS opal 600lm with firmware 1.3.009. Thanks to this function I could see that the E27 bulb was routed through the GU10 ones. I powered off all the GU10 and like magic the E27 started to work again.

So it's becoming pretty clear that routing trough Innr and IKEA are causing issues. I have started to move Innr och IKEA bulbs to my hue gateway. It seems like the philips bulbs are good routers and not causing issues.

So, i did a little experiment while repositioning my lights in the living room. I removed two routers and re-positioned one light. Et voila, one light has gone zombie, but not completely. It behaves exactly as I have seen with the IKEA lights which occasionally lose connection.

The IKEA light is listed as unreachable, but still responds to group commands. Also since it is still known by several other routers, it seems they report seeing the light (it's 0x000B57FFFEC52C7D) :

11:49:33:392 ZDP Mgmt_Lqi_rsp zdpSeq: 190 from 0x000B57FFFE9BEBC3 total: 11, startIndex: 6, listCount: 3
11:49:33:392     * neighbor: 0x00158D00017028B0 (0xA3F5), LQI: 71, relation: 0x02 rxOnWHenIdle: 1
11:49:33:392     * neighbor: 0x90FD9FFFFE054F81 (0xC7BC), LQI: 41, relation: 0x02 rxOnWHenIdle: 1
11:49:33:393     * neighbor: 0x000B57FFFEC52C7D (0xD338), LQI: 69, relation: 0x02 rxOnWHenIdle: 1

and

11:49:50:017 Node 0x000B57FFFEC52C7D is known by 7 neighbors, last seen 33 s

Sometimes, it seems communication is possible:

11:49:44:901 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 11:49:44:902 0x0000 11:49:44:902 ]
11:49:44:902 add task 6636 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 213
11:49:44:902 Poll APS request 213 to 0x000B57FFFEC52C7D cluster: 0x0006
11:49:44:904 Idle timer triggered
11:49:44:951 Poll APS confirm 213 status: 0x00
11:49:44:951 Erase task req-id: 213, type: 19 zcl seqno: 89 send time 0, profileId: 0x0104, clusterId: 0x0006

sometimes not:

12:18:51:033 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:18:51:033 0x0000 12:18:51:033 ]
12:18:51:033 add task 4658 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 119
12:18:51:033 Poll APS request 119 to 0x000B57FFFEC52C7D cluster: 0x0006
---
12:19:02:077 Poll APS confirm 119 status: 0xA7
12:19:02:077     drop item attr/modelid
12:19:02:077     drop item attr/swversion
12:19:02:077     drop item state/bri
12:19:02:077 0x000B57FFFEC52C7D error APSDE-DATA.confirm: 0xA7 on task

There is very little information on what is going wrong. What does the 0xA7 status mean..?

Any further information that might help to figure out what is happening?

So, i did a little experiment while repositioning my lights in the living room. I removed two routers and re-positioned one light. Et voila, one light has gone zombie, but not completely. It behaves exactly as I have seen with the IKEA lights which occasionally lose connection.

The IKEA light is listed as unreachable, but still responds to group commands. Also since it is still known by several other routers, it seems they report seeing the light (it's 0x000B57FFFEC52C7D) :

11:49:33:392 ZDP Mgmt_Lqi_rsp zdpSeq: 190 from 0x000B57FFFE9BEBC3 total: 11, startIndex: 6, listCount: 3
11:49:33:392     * neighbor: 0x00158D00017028B0 (0xA3F5), LQI: 71, relation: 0x02 rxOnWHenIdle: 1
11:49:33:392     * neighbor: 0x90FD9FFFFE054F81 (0xC7BC), LQI: 41, relation: 0x02 rxOnWHenIdle: 1
11:49:33:393     * neighbor: 0x000B57FFFEC52C7D (0xD338), LQI: 69, relation: 0x02 rxOnWHenIdle: 1

and

11:49:50:017 Node 0x000B57FFFEC52C7D is known by 7 neighbors, last seen 33 s

Sometimes, it seems communication is possible:

11:49:44:901 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 11:49:44:902 0x0000 11:49:44:902 ]
11:49:44:902 add task 6636 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 213
11:49:44:902 Poll APS request 213 to 0x000B57FFFEC52C7D cluster: 0x0006
11:49:44:904 Idle timer triggered
11:49:44:951 Poll APS confirm 213 status: 0x00
11:49:44:951 Erase task req-id: 213, type: 19 zcl seqno: 89 send time 0, profileId: 0x0104, clusterId: 0x0006

sometimes not:

12:18:51:033 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:18:51:033 0x0000 12:18:51:033 ]
12:18:51:033 add task 4658 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 119
12:18:51:033 Poll APS request 119 to 0x000B57FFFEC52C7D cluster: 0x0006
---
12:19:02:077 Poll APS confirm 119 status: 0xA7
12:19:02:077     drop item attr/modelid
12:19:02:077     drop item attr/swversion
12:19:02:077     drop item state/bri
12:19:02:077 0x000B57FFFEC52C7D error APSDE-DATA.confirm: 0xA7 on task

There is very little information on what is going wrong. What does the 0xA7 status mean..?

Any further information that might help to figure out what is happening?

This sounds like a exact replication of my issues!

Power cycled the light

12:39:21:833 ZDP device announce: 0x000B57FFFEC52C7D, 0xD338, 0x8E
12:39:21:833 ZDP add fast discover for 0x000b57fffec52c7d
12:39:21:838 DeviceAnnce of LightNode: 0x000b57fffec52c7d Permit Join: 0
12:39:22:040 ZDP finished fast discover for 0x000b57fffec52c7d

And seems to be happily reporting its status

12:39:22:665 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:39:22:665 0x0000 12:39:22:665 ]
12:39:22:665 add task 10024 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 175
12:39:22:665 Poll APS request 175 to 0x000B57FFFEC52C7D cluster: 0x0006
12:39:22:753 Poll APS confirm 175 status: 0x00
12:39:22:753 Erase task req-id: 175, type: 19 zcl seqno: 167 send time 0, profileId: 0x0104, clusterId: 0x0006
12:39:22:920 Node 0x000B57FFFEC52C7D is known by 9 neighbors, last seen 1 s

then:

12:39:26:760 ZDP active ep request to 0x000b57fffec52c7d
12:39:26:760 ZDP send request id: 0x07 to 0x000b57fffec52c7d
---
12:39:32:520 node 0000B57FFFEC52C7D leave wait state
12:39:32:521 ZDP active ep request to 0x000b57fffec52c7d
12:39:32:521 Incr. ZDP retry count 2 on item 7
---
12:39:44:655 add task 10125 type 21 to 0x000B57FFFEC52C7D cluster 0x0004 req.id 60
12:39:44:700 Erase task req-id: 60, type: 21 zcl seqno: 171 send time 0, profileId: 0x0104, clusterId: 0x0004
---
12:39:45:405 create binding for attribute reporting of cluster 0x0000
12:39:45:405 queue binding task for 0x000B57FFFEC52C7D, cluster 0x0000
12:39:45:405 create binding for attribute reporting of cluster 0x0006
12:39:45:405 queue binding task for 0x000B57FFFEC52C7D, cluster 0x0006
12:39:45:405 create binding for attribute reporting of cluster 0x0008
12:39:45:405 queue binding task for 0x000B57FFFEC52C7D, cluster 0x0008
12:39:45:405 Force binding of attribute reporting for node HK houtlamp 2
---
12:39:50:760 Node 0x000B57FFFEC52C7D is known by 9 neighbors, last seen 28 s
---
12:40:06:904 binding/unbinding timeout srcAddr: B57FFFEC52C7D, retry
12:40:08:905 binding/unbinding timeout srcAddr: B57FFFEC52C7D, retry
---
12:40:11:881 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:40:11:881 0x0000 12:40:11:881 ]
12:40:11:882 add task 10241 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 205
12:40:11:882 Poll APS request 205 to 0x000B57FFFEC52C7D cluster: 0x0006
---
12:40:21:640 ZDP skip fetch, node 0xB57FFFEC52C7D has unconfirmed requests [1]
---
12:40:22:957 Poll APS confirm 205 status: 0xA7
12:40:22:957     drop item attr/modelid
12:40:22:957     drop item attr/swversion
12:40:22:957     drop item state/bri
12:40:22:957 0x000B57FFFEC52C7D error APSDE-DATA.confirm: 0xA7 on task
12:40:22:957 Erase task req-id: 205, type: 19 zcl seqno: 175 send time 11, profileId: 0x0104, clusterId: 0x0006
12:40:22:957 max transmit errors for node 0x000B57FFFEC52C7D, last seen by neighbors 78 s
---
12:40:28:904 giveup binding srcAddr: B57FFFEC52C7D
---
12:40:30:904 binding/unbinding timeout srcAddr: B57FFFEC52C7D, retry
---
12:40:32:050 giveup binding srcAddr: B57FFFEC52C7D
---
12:40:33:568 ZDP skip fetch, node 0xB57FFFEC52C7D has unconfirmed requests [1]
---
12:40:39:481 ZDP skip fetch, node 0xB57FFFEC52C7D has unconfirmed requests [1]
---
12:40:42:987 max transmit errors for node 0x000B57FFFEC52C7D, last seen by neighbors 10 s
---
12:40:43:276 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:40:43:276 0x0000 12:40:43:276 ]
12:40:43:277 add task 10380 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 157
12:40:43:277 Poll APS request 157 to 0x000B57FFFEC52C7D cluster: 0x0006
---
12:40:54:621 Poll APS confirm 157 status: 0xA7
12:40:54:621     drop item attr/modelid
12:40:54:621     drop item attr/swversion
12:40:54:621     drop item state/bri
12:40:54:621 0x000B57FFFEC52C7D error APSDE-DATA.confirm: 0xA7 on task
12:40:54:621 Erase task req-id: 157, type: 19 zcl seqno: 180 send time 12, profileId: 0x0104, clusterId: 0x0006
12:40:54:621 max transmit errors for node 0x000B57FFFEC52C7D, last seen by neighbors 22 s

it stops working. until.. 2 minutes later

12:42:10:529 LightNode removed 0x000b57fffec52c7d
12:42:10:530 Node zombie state changed 0x000b57fffec52c7d
---
12:42:39:361 Node 0x000B57FFFEC52C7D is known by 0 neighbors, last seen 0 s
---
12:43:06:904 add task 11002 type 21 to 0x000B57FFFEC52C7D cluster 0x0004 req.id 83
12:43:06:953 Erase task req-id: 83, type: 21 zcl seqno: 198 send time 0, profileId: 0x0104, clusterId: 0x0004
12:43:07:329 Node 0x000B57FFFEC52C7D is known by 0 neighbors, last seen 22 s
---
12:43:12:455 read attributes of 0x000B57FFFEC52C7D cluster: 0x0006: [ 12:43:12:455 0x0000 12:43:12:455 ]
12:43:12:455 add task 11026 type 19 to 0x000B57FFFEC52C7D cluster 0x0006 req.id 125
12:43:12:455 Poll APS request 125 to 0x000B57FFFEC52C7D cluster: 0x0006
12:43:12:498 ZDP status = 0x00 -> SUCCESS

I guess I need to get myself a zigbee sniffer to figure out what is happening over the air.

after this it stopped working again, somewhat more permanent this time.

What does the 0xA7 status mean..?

See here: https://github.com/dresden-elektronik/deconz-rest-plugin/wiki/Zigbee-Error-Codes-in-the-Log

Thanks for that! somehow it is what I was expecting...

I tried to pair my TRÅDFRI bulb GU10 WS 400lm (7 bulbs) with my Philips Hue bridge. Several other bulbs on the network became unavailable and the TRÅDFRI bulb GU10 WS 400lm had the same behaviour as with Deconz, response on group command but no respond on unicast.

TRÅDFRI bulb GU10 WS 400lm seems to have some kind of problem. I will do some testing if it is induvidual bulbs or all of them thats not working.

Now I have done some testing. My conclusion is that it's in fact individual bulbs that is causing the network to become unresponsive. It appeared almons instantly on the Hue Bridge and thing started to work again as soon as I powered them down. So out of 7 bulbs 3 are affected. I will report if I find some new problems.

@MikaelHoogen, which version of the bulbs and firmware you have?

I have seen this problem on v1 E27 WS 980lm, GU10 WS 400lm, E27 CWS 600lm, E15 WS and all 3 FLOALT panels on the Ikea, HUE and DeCONZ gateways over tha last year and a half.
I'm convinced this is a HW issue. It is totally random which node dies.
Here in Norway all electronics are under 2 years warranty. Will try to open a ticket and hear what they say.
Regarding the FLOALT. A few months back, Ikea was running a clearance sale on them. Perhaps that was to get rid of all v1?
Has anybody seen a FLOALT v2 out there? (with the sy5882 driver perhaps?)

I have seen this problem on v1 E27 WS 980lm, GU10 WS 400lm, E27 CWS 600lm, E15 WS and all 3 FLOALT panels on the Ikea, HUE and DeCONZ gateways over tha last year and a half.
I'm convinced this is a HW issue. It is totally random which node dies.
Here in Norway all electronics are under 2 years warranty. Will try to open a ticket and hear what they say.
Regarding the FLOALT. A few months back, Ikea was running a clearance sale on them. Perhaps that was to get rid of all v1?
Has anybody seen a FLOALT v2 out there? (with the sy5882 driver perhaps?)

That is also what I have been trying to say for quite some time. I have issues only with Trådfri v1 bulbs. The Philips Hue bridge helps but still nodes become unreachable from time to time.

A couple of days ago I decided to replace all 12 IKEA GU10 bulps with the new version (warm white).
That made this issue even worse. Now am also losing Hue GU10 white ambiance and Hue E14 color.

I've been using deConz and IKEA light from the start when they came out. I have about 70 nodes in my network. Most of them being IKEA lights. I also have few Philips Hue (4 lights) and one Osram. Sensors i have few Ikea motion and many Xiaomi motion sensors.

I have been running with this setup for quite a while and not a day goes buy without one another lights get's stuck so i need to power cycle. My wife is not thrilled and me neither. I am very close to ditch all zigbee devices and just go to plain old school lights. It just doesn't work. very disappointing. i wonder if others see same problems using pure Philips or Osram lights or is it just IKEA that's crap ?

I've been using deConz and IKEA light from the start when they came out. I have about 70 nodes in my network. Most of them being IKEA lights. I also have few Philips Hue (4 lights) and one Osram. Sensors i have few Ikea motion and many Xiaomi motion sensors.

I have been running with this setup for quite a while and not a day goes buy without one another lights get's stuck so i need to power cycle. My wife is not thrilled and me neither. I am very close to ditch all zigbee devices and just go to plain old school lights. It just doesn't work. very disappointing. i wonder if others see same problems using pure Philips or Osram lights or is it just IKEA that's crap ?

By looking at all people having the exact same problems in this thread and the makers of deconz cannot figure out what the problem with the IKEA lights are it smells like and looks like a IKEA firmware problem.

Many users are having similar issues even with the IKEA hub. So I guess this is not a unique problem related to deconz. However, if someone could fix this, "patch it/find a workaround" it would be the guys here! <3

I have exactly the same issue both with the family and with the setup.

It's sad :( especially that even the new ones they sell today still have these problems. I just don't understand. I have mine from 2017 so you would expect they fixed the issue by now. I will have to consider to remove the whole thing. It can't be fixed here, @manup has said before that this is an issue in the firmware of the device so it's only IKEA who can fix this and since they haven't done that yet there is no hope.

Deconz version 2.05.69
Conbee II firmware 264A0700

It's not 100% perfect and the dropouts I noticed are E27 Ikea (Rev1) bulbs.
When I had these Ikea lights with my Philips Hue bridge they were rock solid so I probably move them back to Hue and keep this Zigbee mesh purely CC2531 (router firmware) and Xiaomi sensors.
I also have a Innr bulb (E14) and a SP120 but these are still ok with Deconz.

I'm using Home Assistant and Deconz to control about 20 Ikea bulbs (and a host of other sensors). The bulbs are primarily GU10s which are grouped by 3 spots to one light.

I've been using this setup for more than 15 months and since this summer I've had problems with one or more single bulb being stuck and needing a power cycle, or even being removed and re-added to the Zigbee network. This was always GU10s that were defined in a light group in HASS.

A few weeks ago I created groups for my lights in Phoscon, and started to reference these groups in HASS.

After that I've not had a single problem with bulbs being stuck and needing a power cycle. The only problem I see is that when turning off all my indoors lights at once, some times an entire Phoscon/Deconz group remains on.

To me this seems like a problem with the Deconz API and possibly handling multiple bulbs in rapid succession over it.

I´ve also have issues with Ikea devices (bulbs and lately also the Symfonisk remote) becoming unavailable and requiring re-pairing with deCONZ. I´ve lad these issues for 6 months or so I believe. Not much to add more than I think it happens more frequently lately. I don´t have any GU10 bulbs only various E27 and E14 (and various remotes). It seems that certain devices are more prone to drop out (maybe depending on how they mesh or similar). I have approx 20 devices with a max range of 5 m from the Conbee I with latest FW/Phoscon/deCONZ.

I´ve also have issues with Ikea devices (bulbs and lately also the Symfonisk remote) becoming unavailable and requiring re-pairing with deCONZ. I´ve lad these issues for 6 months or so I believe. Not much to add more than I think it happens more frequently lately. I don´t have any GU10 bulbs only various E27 and E14 (and various remotes). It seems that certain devices are more prone to drop out (maybe depending on how they mesh or similar). I have �approx 20 devices with a max range of 5 m from the Conbee I with latest FW/Phoscon/deCONZ.

Try Deconz version 2.05.69. Pretty stable to me with Ikea, Innr and Xiaomi

I've similar issue and ticket open https://github.com/dresden-elektronik/deconz-rest-plugin/issues/2256#issue-543476970

Would say that when downgrade to 2.05.67 version have made my network much stable than ever. Had a few bulbs that went into irresponsive state, but same two and could live with those. Yesterday everything worked as expected.

It looks to be an issue with an mix of all possible parts, and of course make it hard to find. However, deconz version looks to be an major part of having a stable solution or not.

I´ve also have issues with Ikea devices (bulbs and lately also the Symfonisk remote) becoming unavailable and requiring re-pairing with deCONZ. I´ve lad these issues for 6 months or so I believe. Not much to add more than I think it happens more frequently lately. I don´t have any GU10 bulbs only various E27 and E14 (and various remotes). It seems that certain devices are more prone to drop out (maybe depending on how they mesh or similar). I have �approx 20 devices with a max range of 5 m from the Conbee I with latest FW/Phoscon/deCONZ.

Try Deconz version 2.05.69. Pretty stable to me with Ikea, Innr and Xiaomi

Will try that as well as the

A few weeks ago I created groups for my lights in Phoscon, and started to reference these groups in HASS.

So I finally got my sniffing hardware and I've also hacked Wireshark a little to show names for all the ieee802.15.4 addresses (which makes it a lot easier to read what's going on).

Anyway, my knowledge of the zigbee stuff is not very strong, so possibly I'm asking stupid questions. However I've been going through some sniffing logs and hopefully we can see something that explains the flaky behavior of the ikea lights.

See the log below. Does that seem like 'normal' behavior? Both lights involved are Ikea ligths. First DeConz sends a request to the 'Garage', routing through the 'Zolder'. Then 'Zolder' tries to transmit it to the 'Garage', but that seems to fail (Why?) and it is retried 20 times in very quick succession (most are within 3 ms). Finally 'Zolder' gives up and sends a link failure back to DeConz.

Should it be this aggressive in retrying transmission?

Also: I notice that DeConz happily keeps sending requests to Garage through Zolder, although the link is clearly failing. DeConz even does this when Garage is not in the Link Status report of Zolder anymore. Also if sometimes Garage is in the Link Status report of Zolder, it is at cost 7 both incoming and outgoing. Actually the route from Garage to DeConz is not through Zolder...
Why does DeConz keep routing through Zolder even after a Link Failure message?

No.          Time         Source       Transmit Dev Receive Dev  Destination  Protocol     Info         
7580         748.356336   DeConz       DeConz       Zolder       Garage       ZigBee ZDP   Link Quality Request
7582         748.360218   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7583         748.363417   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7584         748.368857   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7585         748.373337   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7587         748.419739   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7588         748.424219   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7589         748.429019   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7590         748.434460   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7591         748.481181   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7592         748.485021   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7593         748.488541   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7594         748.492702   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7595         748.541983   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7596         748.545824   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7597         748.550944   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7598         748.556384   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7599         748.594786   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7600         748.597986   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7601         748.602787   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7602         748.608226   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7603         748.640867   Zolder       Zolder       DeConz       DeConz       ZigBee       Network Status, 0x9e0c: Non-tree Link Failure

Ok, so I've spent some time reading the silabs ZigBee documentation (the IKEA's are based on silabs chips).
Apparently the retry behavior is exactly as described in the documents.

That leaves the question why DeConz insists on using this route. The link cost is high, there is a reply of Link Failure and still that route is used. Why..?
(and better routes are available...)

Very useful insights @djwlindenaar although I’m not able to assist you at this level. This confirms a bit my suspicions. I had a decent working network (with downgrade of software) until I got an accidental power cut, and the whole house was blacked out.

After that I’m back again and think I’ve got bad routes again.

One other thing. How come deconz sets the states even if the actual device not responds? It looks like a bulb is on in the interfaces but physical it is off (still powered). Sometimes it works to turn it on again (via software), sometimes it work with power cycle, but often it does not respond at all even after a power cycle. Same bulb can suddenly respond again after awhile

Some more interesting behavior (i show the same lights, but this is happening in other places in my network as well)

  • DeConz sends many-to-one route Route Request packets. Result is that any device will first do a route discovery. This is supposed to feed the concentrator (coordinator) with information on how to reach that device. It seems DeConz ignores this information for it's routing...
No.         Time        Source      Tran Dev    Recv Dev    Destination Protocol    Info
7544        747.873     DeConz      DeConz      Zolder      Garage      ZigBee ZDP  Link Quality Request
7546        747.876     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7547        747.882     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7548        747.887     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7549        747.892     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7550        747.919     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7551        747.925     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7552        747.929     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7553        747.932     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7554        747.980     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7555        747.985     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7556        747.990     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7557        747.995     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7558        748.046     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7559        748.052     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7560        748.055     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7561        748.061     DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7563        748.066     Garage      Garage      Kerstver    DeConz      ZigBee      Route Record, Dst: 0x0000
7565        748.076     Garage      Kerstver    HK zoutlamp DeConz      ZigBee      Route Record, Dst: 0x0000
7567        748.084     Garage      HK zoutlamp DeConz      DeConz      ZigBee      Route Record, Dst: 0x0000

That last packet has information on how to reach the Garage:

Command Frame: Route Record
    Command Identifier: Route Record (0x05)
    Relay Count: 2
    Relay Device 1: 0x731e[Kerstverli]
    Relay Device 2: 0xa3f5[HK zoutlam]

But next request to Garage is again sent through the Zolder

No.         Time        Source      Tran Dev    Recv Dev    Destination Protocol    Info
7569        748.112847  DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7570        748.117327  DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request
7573        748.126608  DeConz      Zolder      Garage      Garage      ZigBee ZDP  Link Quality Request

I would not be surprised that this behavior has something to do with the flaky IKEA light behavior.

@manup , could you comment on this? Your statement that Source Routing would help, does make a lot of sense.
Alternatively, I expect that using the information form the Route Record to update the routing table of DeConz might be already a huge improvement...
Or we need to figure out why DeConz wants to keep using a route which reports link issues / has high routing costs compared to an alternative.

Another observation... The two relay devices in the above packet are both Xiaomi Plugs. They are never queried for Link Quality. Why?
Could it be that DeConz is using the information in the Link Quality Response to build its route table and therefore ignores the quite valid routers? With good link quality to some of my unhappy IKEA lights...

Do only only see this behavior for ikea devices or a general behavior to ignore the optimal route? I’ll guess you have a decent number of ikea devices?

I'm not sure, I'd need to check that. Good question.

I do have quite a large number of Ikea bulbs, yes.

In my mind, this does explain a large part of the behavior we have seen with the unreachable IKEA lights. Although it just hypotheses for now, to be verified.

Looking into several other routes and I see the same behavior.

I only see that the Xiaomi devices are not queried for Link Quality all other brands are. I guess this is some kind of blacklisting..?

Example other routing behavior:

From DeConz, always:
DeConz --> WC Lamp --> Badkamer Ledstrip

Return route changes often after a many-to-one route request. Note that from a geographical point of view all of these make sense...
Badkamer Ledstrip --> WC Lamp --> DeConz
Badkamer Ledstrip --> Gang 1 --> DeConz
Badkamer Ledstrip --> Zoldertrap Lamp --> DeConz
Badkamer Ledstrip --> Gang 1 --> Zoldertrap Lamp --> DeConz

For the Garage light, I see from DeConz:
DeConz --> Zolder Noord Lamp --> Garage (very flaky route..!)

Return route I see:
Garage --> Kerstverlichting --> On/Off Light 36 --> DeConz
Garage --> Kerstverlichting --> HK Zoutlamp --> DeConz

Now when I look at the table in the Link Status reports, the route above routes from Garage to DeConz Make sense:
Garage --> Kerstverlichting --> On/Off Light 36 --> DeConz total cost: 4
Garage --> Kerstverlichting --> HK Zoutlamp --> DeConz total cost: 4

from DeConz to Garage does not:
DeConz --> Zolder Noord Lamp --> Garage total cost: 12 (based on incoming cost from DeConz to Zolder Noord Lamp)

@manup , could you elaborate on the route cost algorithm in use? Why does the above make sense?

Some more analysis today...
In my previous posts, you have seen that my Garage light is routed through my Zolder light. Both IKEA bulbs. The radio link from Zolder to Garage is right on the edge of what it can reach, so fails often.

Today, although the Garage light responds to group commands, it does not respond to unicast commands. Actually sometimes it does and sometime not. This is a behavior that should be familiar to those who've read/contributed to this thread.

I can find this in the sniffing logs. Sometimes the Zolder light is able to communicate with Garage and sometimes not. Any time Zolder light cannot communicate with Garage, it reports this:

ZigBee Network Layer Command, Dst: DeConz, Src: Zolder No
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0x0000[DeConz]
    Source: 0xd6b7[Zolder Noo]
    Radius: 30
    Sequence Number: 65
    Destination: dresden-_ff:ff:00:c4:9a (00:21:2e:ff:ff:00:c4:9a)
    Extended Source: SiliconL_ff:fe:c3:4a:3e (00:0b:57:ff:fe:c3:4a:3e)
    ZigBee Security Header
    Command Frame: Network Status
        Command Identifier: Network Status (0x03)
        Status Code: Non-tree Link Failure (0x02)
        Destination: 0x9e0c[Garage]

This packet should tell DeConz to start finding another route to reach Garage, but that does not happen. The next packet sent to Garage is again routed through Zolder. To me that is a bug that must be solved.
This next packet for Garage is received by the Zolder light, but that light does not even try to send it to the Garage. Maybe this is a behavior of the IKEA firmware which is not good, but the root cause of the issue is the refusal of DeConz to find an alternative route.
I think that if a route is not available for a prolonged period of time, maybe the Garage light is starved for ACKs at a higher level than the 802.15.4 protocol and that may cause the firmware to disconnect or even crash. And I agree it should not, but the root cause issue is DeConz refusing to find a new route to the Garage light.

Today I did an experiment to get DeConz to find another route to the Garage light, so I disconnected mains from the Zolder light and looked at the sniffing logs. After a few tries, DeConz realizes that Zolder is gone and goes ahead to find an alternative route to Garage. Next I reconnect Zolder and after announcing its presence also for Zolder a new route is found. DeConz does not (yet) revert to routing Garage through Zolder.

Funny thing is that in the new situation, DeConz now directly talks with Garage light, no routers inbetween.
Zolder is now reached through a route via 2 other routers (although it was obviously reachable directly by DeConz), so it looks to me like some table (neighbor table?) is full inside DeConz routing firmware.

Maybe this is related to its refusal to create a new route in response to a failing route..?

@manup , I would appreciate any comment from you on the above posts. Or at least let me know how to contribute to a solution (aside from looking for the root cause).

I would like to help creating a solution for these issues, since they bother me. If you'd give access to the firmware source code I can contribute directly (even if it is not open source). I don't mind helping Dresden Elektronik in that way :)

Excellent work done! 💪🏼
Hope we can get attention from the developers to get this fixed. It looks you have a good setup and process so a fix would be able to be verified quite “easy”.
I understand now the the behavior I’ve seen which I’ve referred to as “self healing” and why some bulbs suddenly is working.

Nice work @djwlindenaar i hope @manup can look at this.

@manup any comments ?

Thank you for this thread and work. I have myself running 20 ikea bulb update firmware and one year old. I experienced drop freezes or lost bulb from beginning on but not too frequent. It got worse with later deconz updates. I checked my network optimized it in 2.4 packed environment. Still bulb drop out daily and making buttons or sensors useless as they get still routed via these bulb and so not sending any data movement. I need to power cycle making sensors available again as bulbs start to route them again. Hopefully this gets mor attention. It is frustrating.

Another observation... The two relay devices in the above packet are both Xiaomi Plugs. They are never queried for Link Quality. Why?
Could it be that DeConz is using the information in the Link Quality Response to build its route table and therefore ignores the quite valid routers? With good link quality to some of my unhappy IKEA lights...

The query of the neighbor table (aka ZDP Mgmt_Lqi_req) was disabled for Xiaomi devices since they happen to not to respond to these queries after a while and I suspected that an error in the firmware might trigger some invalid state or worse.

Thus the main driver for limiting certain requests is to prevent errors in end device firmware and to closely mimic the related vendor gateway behavior, some of the investigation results can be found in https://github.com/dresden-elektronik/deconz-rest-plugin/wiki/End-device-Polling

As a side note the neighbor table queries are only used to build the visual mesh presentation and don't affect the network operation or routing tables.

So I finally got my sniffing hardware and I've also hacked Wireshark a little to show names for all the ieee802.15.4 addresses (which makes it a lot easier to read what's going on).

Anyway, my knowledge of the zigbee stuff is not very strong, so possibly I'm asking stupid questions. However I've been going through some sniffing logs and hopefully we can see something that explains the flaky behavior of the ikea lights.

See the log below. Does that seem like 'normal' behavior? Both lights involved are Ikea ligths. First DeConz sends a request to the 'Garage', routing through the 'Zolder'. Then 'Zolder' tries to transmit it to the 'Garage', but that seems to fail (Why?) and it is retried 20 times in very quick succession (most are within 3 ms). Finally 'Zolder' gives up and sends a link failure back to DeConz.

Should it be this aggressive in retrying transmission?

Also: I notice that DeConz happily keeps sending requests to Garage through Zolder, although the link is clearly failing. DeConz even does this when Garage is not in the Link Status report of Zolder anymore. Also if sometimes Garage is in the Link Status report of Zolder, it is at cost 7 both incoming and outgoing. Actually the route from Garage to DeConz is not through Zolder...
Why does DeConz keep routing through Zolder even after a Link Failure message?

No.          Time         Source       Transmit Dev Receive Dev  Destination  Protocol     Info         
7580         748.356336   DeConz       DeConz       Zolder       Garage       ZigBee ZDP   Link Quality Request
7582         748.360218   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7583         748.363417   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7584         748.368857   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7585         748.373337   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7587         748.419739   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7588         748.424219   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7589         748.429019   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7590         748.434460   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7591         748.481181   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7592         748.485021   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7593         748.488541   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7594         748.492702   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7595         748.541983   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7596         748.545824   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7597         748.550944   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7598         748.556384   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7599         748.594786   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7600         748.597986   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7601         748.602787   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7602         748.608226   DeConz       Zolder       Garage       Garage       ZigBee ZDP   Link Quality Request
7603         748.640867   Zolder       Zolder       DeConz       DeConz       ZigBee       Network Status, 0x9e0c: Non-tree Link Failure

Good catch! Here I would suspect the deCONZ firmware should handle this error and retry to find a new route. I will check the firmware if this case is handled and how. I can image that the route is actually discarded but if a new message is received through the same route. For example due a ZCL attribute report from the light, the entry is just kept alive.

Routes can be established just by receiving any command from an incoming frame. But if the last hop doesn't keep the hop before that (which they usually do) the way back through the same hop won't work. Your findings might reflect exactly that case.

Thanks.

In this case, there is never a message from the Garage light received through Zolder. Therefore I wouldn't expect DeConz to keep using the route. See my other posts...

Status Code: Non-tree Link Failure (0x02)

Seems, we got a winner, I've checked the Zigbee stack source (R21, ConBee II) and this status code is completely ignored. :-O Need to check the AVR Zigbee stack too, likely the same problem there.

As a fix I'd like to handle this code in the same way as the "No Route Available" status is handled, which is just to free the route entry and let the stack figure out a new one.

Do you have also a full packet of the command which was prior send from coordinator to the IKEA light? Would be interesting if the discover route bit was set.

Here is the section from the Zigbee specification on what should happen in this particular case:

On receipt of a network status command frame by a router that is the intended destination of the command where the status code field of the command frame payload has a value of 0x01 or 0x02 indicating a link failure, the NWK layer will remove the routing table entry corresponding to the value of the destination address field of the command frame payload, if one exists, and inform the next higher layer of the failure using the NLME-NWK-STATUS.indication using the same status code.

So the fix above should work here for 0x02, 0x01 is also not handled I'll add that too.

0x00 No Route Available  (already handled)
0x01 Tree Link Failure
0x02 Non Tree Link Failure

Great @manup! You mentioned the Conbee II. Will this fix also work for the Conbee I?

Yes, I've just the stack sources used by ConBee I and RaspBee. The same problem here, I'll cook new firmware versions for testing till Tuesday. Would be cool to get some feedback if that solves at least the IKEA routing issues.

Note there were also other issues which won't be fixed by this, like IKEA bulbs going completely silent and not even listen to group casts.

Accidentally closed this issue... Reopened. Great that we are making progress on this.

I guess @djwlindenaar is the one who is able to give some feedback after the new firmware is available because the sniffing hardware is needed for this?

Do you have also a full packet of the command which was prior send from coordinator to the IKEA light? Would be interesting if the discover route bit was set.

I'm not sure I understand your question. What packet are you looking for? And to which light (Garage or Zolder)?

Yes, I've just the stack sources used by ConBee I and RaspBee. The same problem here, I'll cook new firmware versions for testing till Tuesday. Would be cool to get some feedback if that solves at least the IKEA routing issues.

Note there were also other issues which won't be fixed by this, like IKEA bulbs going completely silent and not even listen to group casts.

I'll test it for sure. Although it may take some time to get to a conclusion because sometimes there's no issue for a long time...

I was thinking that maybe there's something triggering that behavior of going completely silent. I've had this happening last week but didn't yet have time for going through the sniffer logs.

I'm on Raspbee btw.

Yes, I've just the stack sources used by ConBee I and RaspBee. The same problem here, I'll cook new firmware versions for testing till Tuesday. Would be cool to get some feedback if that solves at least the IKEA routing issues.

Let us know, I’ve this issue a lot and could test, although not making any sniffing.

Note there were also other issues which won't be fixed by this, like IKEA bulbs going completely silent and not even listen to group casts.

Do you refer this to the issue that the software (deconz) reporting and setting a state but the bulb itself does not respond or change the state? Maybe the issue would not be same severity if we nail down the routing issue.

Thanks fir looking into this, highly appreciated!

Yes, I've just the stack sources used by ConBee I and RaspBee. The same problem here, I'll cook new firmware versions for testing till Tuesday. Would be cool to get some feedback if that solves at least the IKEA routing issues.

Note there were also other issues which won't be fixed by this, like IKEA bulbs going completely silent and not even listen to group casts.

Thank you manup! I will upgrade to the new fw the second it is available and provide updates.

Could this issue also be linked to the problem with controlling the Ikea Fyrtur blinds?

Another one which seems odd to me. @manup , could this be an issue as well?

I see a lot of these kind of sequences. I started looking into this, because this is the last communication from houtlamp before it stops responding. DeConz configures 2 attribute reporting items and almost at the same time requests group memberships. In de DeConz logs this looks like below.

I'm wondering whether there's a bug in selecting the sequence number or maybe it is allowed (I don't know) and this behavior is triggering a bug in the Tradfri firmware. There is one request with number 215 and two requests with number 216. Could it be that we have some kind of race condition where the handling of the requests by the tradfri firmware is causing it to memory leak or otherwise hang up due to the two requests with the same number at almost the same time? Shoud DeConz have two requests out with the same sequence number?

13:09:40:905 Force read attributes for node HK houtlamp
13:09:40:905 binding for cluster 0x0000 of 0x000B57FFFEDBFE18 exists (verified by reporting)
13:09:40:905 binding for cluster 0x0006 of 0x000B57FFFEDBFE18 exists (verified by reporting)
13:09:40:905 configure reporting rq seq 215 for 0x000B57FFFEDBFE18, attribute 0x0006/0x0000
13:09:40:906 binding for cluster 0x0008 of 0x000B57FFFEDBFE18 exists (verified by reporting)
13:09:40:906 configure reporting rq seq 216 for 0x000B57FFFEDBFE18, attribute 0x0008/0x0000
13:09:40:906 Force binding of attribute reporting for node HK houtlamp
13:09:40:908 add task 7435258 type 21 to 0x000B57FFFEDBFE18 cluster 0x0004 req.id 233
13:09:41:013 Erase task req-id: 233, type: 21 zcl seqno: 216 send time 0, profileId: 0x0104, clusterId: 0x0004
13:09:41:053 ZCL configure reporting rsp seq: 215 0x000B57FFFEDBFE18 for cluster 0x0006 attr 0x0000 status 0x00
13:09:41:077 ZCL configure reporting rsp seq: 216 0x000B57FFFEDBFE18 for cluster 0x0008 attr 0x0000 status 0x00
13:09:41:116 verified group capacity: 255 and group count: 2 of LightNode 0x000b57fffedbfe18
13:09:41:116 0x000b57fffedbfe18 found group 0x0001
13:09:41:116 0x000b57fffedbfe18 found group 0xFFF0

Over the air this looks like: (it seems I'm not seeing every packet over the air, which is odd since the sniffer is right next to the raspbee. Maybe they get sent so quick they get lost in a buffer overflow or something on the sniffer)

No.          Source       Transmit Dev Receive Dev  Destination  Protocol     Info
2196482      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee HA    ZCL: Configure Reporting, Seq: 215
2196484      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       Route Record, Dst: 0x0000
2196486      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       Route Record, Dst: 0x0000
2196488      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       Route Record, Dst: 0x0000
2196490      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       Route Record, Dst: 0x0000
2196492      DeConz       DeConz       HK zoutlamp  HK houtlamp  ZigBee HA    ZCL Groups: Get Group Membership, Seq: 216
2196494      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee HA    ZCL: Configure Reporting Response, Seq: 215
2196496      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee HA    ZCL Groups: Get Group Membership, Seq: 216
2196497      DeConz       DeConz       HK zoutlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196498      DeConz       DeConz       HK zoutlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196500      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196502      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee HA    ZCL Groups: Get Group Membership, Seq: 216
2196504      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196506      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       Route Record, Dst: 0x0000
2196508      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee HA    ZCL: Configure Reporting Response, Seq: 216
2196510      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196511      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196512      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196513      DeConz       DeConz       HK zoutlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196515      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196517      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196519      HK houtlamp  HK houtlamp  DeConz       DeConz       ZigBee HA    ZCL Groups: Get Group Membership Response, Seq: 216
2196521      DeConz       DeConz       HK zoutlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1
2196523      DeConz       HK zoutlamp  HK houtlamp  HK houtlamp  ZigBee       APS: Ack, Dst Endpt: 1, Src Endpt: 1

Indeed the sequence numbers should not be equal in this short time, I'll check the code, looks like an increment is missing.

Here is the first test firmware for ConBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_R21_0x26520700.bin.GCF

Now porting this back into ConBee I /RaspBee firmware, will be online shortly.

And here is the test version for ConBee I and RaspBee with the route error fix.
Hope it brings some improvements to discover a new route when the error happens.

For testing I think it would be good if it just runs for a few days/weeks and we'll see if lights still stop responding to unicasts. In this case please try if the lights still react to group commands or go completely silent.

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26340500.bin.GCF

For testing I think it would be good if it just runs for a few days/weeks and we'll see if lights still stop responding to unicasts. In this case please try if the lights still react to group commands or go completely silent.

And when it reacts to group commands, wait a few days to verify it keeps reacting to group commands. In my case, when lights are reacting to group commands only, sooner of later they go completely silent.

And when it reacts to group commands, wait a few days to verify it keeps reacting to group commands. In my case, when lights are reacting to group commands only, sooner of later they go completely silent.

I've seen this too in the past, perhaps they do this since no unicasts are received anymore, time will tell.

Maybe yes, at least the reachable state would turn to false too when the route is broken. But this may also caused by other reasons.

I've noticed that there is a newer firmware for some ikea bulbs. The change log states improvement in network/connectivity and failure mgmt. I am updating 20 bulbs... i will provide updates if helpful.
image

https://ww8.ikea.com/ikeahomesmart/releasenotes/releasenotes.html

Pity it don't contain release dates ...

It seems to be released as i was able to dowload the firmwas 2.3.007 described in the release notes.

Must have been around January https://www.iphone-ticker.de/ikea-home-smart-homekit-und-bridge-update-verfuegbar-152574/

not a programmer her. I guess i should not install r21 on conbee 2 and wait ? This is a beta firmware?

Slightly off topic: also new firmware for the Trådfri dimmer, upgrading it to ZB3. Cross referencing #2485, we’ll have the same issue for the dimmer... I expect the old model motion sensor to be next...

And here is the test version for ConBee I and RaspBee with the route error fix.
Hope it brings some improvements to discover a new route when the error happens.

For testing I think it would be good if it just runs for a few days/weeks and we'll see if lights still stop responding to unicasts. In this case please try if the lights still react to group commands or go completely silent.

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26340500.bin.GCF

Thanks, @manup , just installed it. Let's see what it will bring.

Although I expect this will bring some improvement, how do you feel about DeConz firmware ignoring the recorded routes (due to the mane-to-one routing)? In general, I've seen that the recorded routes are much more logical than the ones used by DeConz firmware.
Although I can imagine enabling source routing is a big job, DeConz firmware could (should?) still use the information from the route record packet to update the first hop to a device. Maybe just check whether the last hop to the coordinator matches what is stored in the routing table and if not, invalidate the entry in the routing table. I'm not sure if it's OK to replace the entry with the last hop from the recorded route, since the devices along the way probably ignore the information, but at least it could initiate a new route discovery for that node by DeConz firmware.

What do you think about this?

I have installed the firmware on my raspbee, (i have 73 nodes mostly ikea), will report back findings.

Indeed the sequence numbers should not be equal in this short time, I'll check the code, looks like an increment is missing.

@manup , Maybe to help you pinpoint the issue, I saw that problem again today. It seems to be related to 2 configure reporting actions very close together. The second seq: 183 in this case is a different one from the one I reported before.

No.             Source          Transmit Dev    Receive Dev     Destination     Protocol        Info
39174           DeConz          DeConz          Gang 1          Gang 1          ZigBee HA       ZCL: Configure Reporting, Seq: 182
39180           DeConz          DeConz          Gang 1          Gang 1          ZigBee HA       ZCL: Configure Reporting, Seq: 183
39227           DeConz          DeConz          On/Off light 36 Badkamer ledstr ZigBee HA       ZCL: Read Attributes, Seq: 183

The sequence number issue should be fixed in 2.05.74, which is building now and will be available in a few hours.

https://github.com/dresden-elektronik/deconz-rest-plugin/commit/33d8a8b349c9f4967e8b94ed2657e038406317c8

And here is the test version for ConBee I and RaspBee with the route error fix.
Hope it brings some improvements to discover a new route when the error happens.
For testing I think it would be good if it just runs for a few days/weeks and we'll see if lights still stop responding to unicasts. In this case please try if the lights still react to group commands or go completely silent.
http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26340500.bin.GCF

Thanks, @manup , just installed it. Let's see what it will bring.

Although I expect this will bring some improvement, how do you feel about DeConz firmware ignoring the recorded routes (due to the mane-to-one routing)? In general, I've seen that the recorded routes are much more logical than the ones used by DeConz firmware.
Although I can imagine enabling source routing is a big job, DeConz firmware could (should?) still use the information from the route record packet to update the first hop to a device. Maybe just check whether the last hop to the coordinator matches what is stored in the routing table and if not, invalidate the entry in the routing table. I'm not sure if it's OK to replace the entry with the last hop from the recorded route, since the devices along the way probably ignore the information, but at least it could initiate a new route discovery for that node by DeConz firmware.

What do you think about this?

Hmm strange the firmware should already use these Route Records to establish new Routes, need to check the code here, but if my memory services me right this was enabled a while ago.

What do you think about this?

Hmm strange the firmware should already use these Route Records to establish new Routes, need to check the code here, but if my memory services me right this was enabled a while ago.

Well I got this stuff going on with Zolder and Garage (from a number of posts back) and I got this behavior:

No.             Source          Transmit Dev    Receive Dev     Destination     Protocol        Info
163778          Garage          Kerstverlicht   DeConz          DeConz          ZigBee          Route Record, Dst: 0x0000
163779                                                                          IEEE 802.15.4   Ack

The route record showed:

Command Frame: Route Record
    Command Identifier: Route Record (0x05)
    Relay Count: 1
    Relay Device 1: 0x731e[Kerstverli]

Next communication from DeConz to Garage does:

No.             Source          Transmit Dev    Receive Dev     Destination     Protocol        Info
163788          DeConz          DeConz          Zolder          Garage          ZigBee          APS: Ack, Dst Endpt: 0, Src Endpt: 0

So it's clearly not taking over the route recordings.

What I noticed at this time was that, as soon as I unpowered Zolder, DeConz immediately was capable of finding a new route. So maybe some of the routing related tables are full inside the DeConz firmware. Could this be true? How big are the neighbor/routing tables in the firmware? Could a full table be blocking the adoption of new route recordings?

Success! I can confirm that the behavior is now that after a non-tree link failure, route discovery starts indeed.

No.             Source          Transmit Dev    Receive Dev     Destination     Protocol        Info
173142          DeConz          Buiten - R sch  Tuin rechtsach1 Tuin rechtsach1 ZigBee ZDP      Bind Request, Basic (Cluster ID: 0x0000) Src: SiliconL_ff:fe:16:47:5f, Dst: dresden-_ff:ff:00:c4:9a
173143          Buiten - R sch  Buiten - R sch  DeConz          DeConz          ZigBee          Network Status, 0x35b7: Non-tree Link Failure
173144                                                                          IEEE 802.15.4   Ack
173206          DeConz          DeConz          Broadcast       Broadcast       ZigBee          Route Request, Dst: 0x35b7, Src: 0x0000

The sequence number issue should be fixed in 2.05.74, which is building now and will be available in a few hours.

33d8a8b

"swversion": "2.5.74",

:smile: Thanks @manup great response time :1st_place_medal:

Please wait for the update to 2.05.74, or use http://phoscon.de/app to show the Phoscon App. We've just seen that a gateway offline page is shown on some pages where it shouldn't, rebuilding now ~ 2 hours.

Indeed the sequence numbers should not be equal in this short time, I'll check the code, looks like an increment is missing.

Here is the first test firmware for ConBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_R21_0x26520700.bin.GCF

Now porting this back into ConBee I /RaspBee firmware, will be online shortly.

This FW doesn't work for me. It shows all my devices as joined, but no mesh is built. Neither the connections are shown in the GUI, nor did any switching from Phoscon work (2.05.74). Flashed the Conbee II back to 264A0700 and everything works again as expected...

Hmm strange how long did it run?
I'm running this firmware on multiple setups now without issues so far.

Hmm strange how long did it run?
I'm running this firmware on multiple setups now without issues so far.

A few minutes. I've seen some activities on some devices (blue points) but no links building. And switching didn't work at all. Should I test for a longer time?

It may take a while till the mesh gets build, but you should see lines after 5–10 minutes.

This FW doesn't work for me.

Same here, Rasperry Pi 4B, deCONZ 2.05.74, ConBee II. Small test network with Trådri repeater, Trådfri plug, XBee, Trådfri dimmer, 2 Trådfri motion sensors (old and new), and Trådri On/Off switch. Gateway doesn't seem to send nor receive any traffic from any device. Seeing USB disconnects in the syslog. Everything is hunky dory again, after flashing back to 4A.
log.gz

Just flashed again to test:

  • no lines in GUI (let it run for 15 minutes)
  • USB connection seems stable
  • UBISYS C4 and D1 switches still work
  • Tradfri Buttons don't

@ebaauw the log shows that the coordinator sees only two neighbors with this version.

Perhaps it's the network size. I have currently only 40 devices powered here. This firmware version has two other changes which might be the cause, increased message buffer size (a bit over the top as far as I can tell, will reduce that again) and and a slightly lower TX power setting.

I'll prep another version with these settings removed to see if that works better.

Do you use a USB extension cable, or is the ConBee II connected directly?

For me it works fine on Raspbee...

On the RaspBee and ConBee 1 only the Route fix is included in the new version (different firmware).

the log shows that the coordinator sees only two neighbors with this version.

Yes, two of end-devices showed as neighbours in the GUI. Kinda strange, because they showed connected to another router after downgrading the ConBee II firmware.

Do you use a USB extension cable, or is the ConBee II connected directly?

It's been connected directly ever since I got it. To a USB-2 port on the Pi, with the XBee connected to the other USB-2 port; nothing on USB-3. The connection has been stable, except when I configured the ConBee II as router (see #2463).

On the RaspBee and ConBee 1 only the Route fix is included in the new version (different firmware).

Haven't tried that yet. Will have to wait till later...

I see quite a lot of configure reporting packets. After a bit of searching it seems that these are sent periodically about every half hour. What is the reason these configure reporting requests are repeated periodically?
@ebaauw , do I remember correctly that you did a lot of these investigations into behavior of the IKEA coordinator? Does it do that too?

I noticed in de_web_plugin_private.h

#define IDLE_ATTR_REPORT_BIND_LIMIT 1800

I found another bit of evidence regarding the routing behavior of DeConz ignoring the many-to-one route records. It is causing some routers to have to figure out routing in the 'old-fashioned' way.
(note I have some devellish interference in packet 117130 :wink:)

The route for badkamer ledstrip, according to DeConz, starts with the On/Off light 36 which obviously doesn't know how to get there. So it starts a route discovery and finally finds it can (barely, I think) reach the ledstrip itself. Return path is through Gang 1

It looks like On/Off light 36 forgets about Badkamer ledstrip every time a many-to-one route request is processed...

No.               Source            Transmit Dev      Receive Dev       Destination       Protocol          Info
177114            DeConz            DeConz            On/Off light 36   Badkamer ledstrip ZigBee HA         ZCL: Read Attributes, Seq: 55
177115                                                                                    IEEE 802.15.4     Ack
177116            On/Off light 36   On/Off light 36   Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177117            On/Off light 36   HK plafond ledstrip                 Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177118            On/Off light 36   HK zoutlamp       Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177119            On/Off light 36   Zoldertrap Lamp   Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177120            On/Off light 36   Kerstverlichting  Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177121            On/Off light 36   HK stalamp        Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177122            On/Off light 36   DeConz            Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177123            On/Off light 36   HK houtlamp 2     Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177124            On/Off light 36   Keuken links      Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177125            On/Off light 36   Keuken Rechts     Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177126            On/Off light 36   HK houtlamp       Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177127            On/Off light 36   Keuken mid        Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177128            On/Off light 36   Tuin linksvoor 2  Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177129            On/Off light 36   WC lamp           Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177130            0x6666            0x6666            0x6666            0x6666            IEEE 802.15.4     Data, Dst: 0x6666, Src: 0x6666, Bad FCS
177131            On/Off light 36   Gang 1            Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177132            On/Off light 36   Hal lamp          Broadcast         Broadcast         ZigBee            Route Request, Dst: 0xc520, Src: 0xde81
177133            DeConz            On/Off light 36   Badkamer ledstrip Badkamer ledstrip ZigBee HA         ZCL: Read Attributes, Seq: 55
177134                                                                                    IEEE 802.15.4     Ack
177135            Badkamer ledstrip Badkamer ledstrip Gang 1            DeConz            ZigBee            Command, Dst: DeConz, Src: Badkamer , Bad FCS
177136                                                                                    IEEE 802.15.4     Ack
177137            On/Off light 36   Voordeur          Broadcast         0xfcfd            ZigBee            Command, Dst: 0xfcfd, Src: On/Off li, Bad FCS
177138            Badkamer ledstrip Gang 1            DeConz            DeConz            ZigBee            Route Record, Dst: 0x0000

next communication:

No.               Source            Transmit Dev      Receive Dev       Destination       Protocol          Info
177366            DeConz            DeConz            On/Off light 36   Badkamer ledstrip ZigBee HA         ZCL: Read Attributes, Seq: 58

Another try on the ConBee II firmware:

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26530700.bin.GCF

This version has the TX power settings of 0x264a0700. The buffers are still large (but according to the map they should fit nicely in the RAM) to check only one thing at a time.

I got this error, every time I try to update to latest firmware. Any hints why?
rich710@RichHassPc01:~$ sudo GCFFlasher_internal -d /dev/ttyACM1 -f deCONZ_ConBeeII_0x26530700.bin.GCF
GCFFlasher V3_13 (c) dresden elektronik ingenieurtechnik gmbh
Reboot device /dev/ttyACM1 (ConBee II)
deCONZ firmware version 26490700
R21B18 Bootloader
Vers: 2.05
build: Mar 22 2019
flashing 161378 bytes: |==============================|
verify: .
Flash update failed, invalid CRC. Please try again.
rich710@RichHassPc01:~$

Have you verified the MD5 sum of the downloaded file? (http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26530700.bin.GCF.md5)

Yes I have verified, and downloaded several times, even tried to revert to old firmware, and it went good, until now. Now I am stuck with this error, I have rebooted my NUC several times but still I'm stuck when flasher tries to reboot my ConbeeII.
rich710@RichHassPc01:~$ sudo GCFFlasher_internal -d /dev/ttyACM1 -f deCONZ_ConBeeII_0x26490700.bin.GCF
[sudo] password for rich710:
GCFFlasher V3_13 (c) dresden elektronik ingenieurtechnik gmbh
Reboot device /dev/ttyACM1 (ConBee II)

2139: Error: uart reset failed, check retry

Noticed you use ttyACM1, when you use

GCFFlasher -l

It will show the serial number.
You can use it to provide a more stable device name in the command.

GCFFlasher_internal -sn DE1948474 -f deCONZ_ConBeeII_0x26530700.bin.GCF

(replace with your device serial number)

Sorry didn't work, maybe I should move it to my Windows machine and try
rich710@RichHassPc01:~$ GCFFlasher_internal -l
GCFFlasher V3_13 (c) dresden elektronik ingenieurtechnik gmbh
Path | Vendor | Product | Serial | Type
-----------------+--------+---------+------------+-------
/dev/ttyACM1 | 0x1CF1 | 0x0030 | DE1964163 | ConBee II
/dev/ttyUSB0 | 0x0403 | 0x6001 | A1YV35M2 | Generic FTDI
rich710@RichHassPc01:~$ GCFFlasher_internal -sn DE1964163 -f deCONZ_ConBeeII_0x26530700.bin.GCF
GCFFlasher V3_13 (c) dresden elektronik ingenieurtechnik gmbh
Reboot device (ConBee II)

2139: Error: uart reset failed, check retry
rich710@RichHassPc01:~$

Either that or as alternative you may try following:

  • Unplug the ConBee II
  • GCFFlasher_internal -t 60 -sn DE1964163 -f deCONZ_ConBeeII_0x26530700.bin.GCF
  • Plugin the ConBee II again

The -t parameters lets the GCFFlasher try for one minute to process the update.

It worked to update the firmware in windows, and when I plugged it in my NUC with ubuntu again it connected. BUT, after one hour now, it has just connected to 4 om my nodes.. :S
Annotation 2020-02-26 172624

Another try on the ConBee II firmware: http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26530700.bin.GCF

No more USB disconnects, but deCONZ seems to re-detect the ConBee II quite often. Still no traffic.
log.gz

EDIT To humour you, I disconnected the XBee and used 10cm USB extension cable to connect the ConBee II: no change.

Another try on the ConBee II firmware:

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26530700.bin.GCF

This version has the TX power settings of 0x264a0700. The buffers are still large (but according to the map they should fit nicely in the RAM) to check only one thing at a time.

Tested your new version, but after 10 minutes there's still no link shown... With the stable FW, all good (router) links in the network are displayed after about 2 minutes. The IKEA push button works immediately, while with the test firmware it doesn't work at all.

Hmm spooky, thanks for testing this. Next I'll try to lower the buffer sizes as mentioned before.
Still I wonder why it is working on my setup. From the screenshot above I can only tell that I have more end-devices than routers.

@manup had the same issues after flashing the firmware, but pressing reboot device in deconz/restarting deconz made the links come back.

And here is the test version for ConBee I and RaspBee with the route error fix.

Seem to run fine on my Pi 3B+ test system. Will wait till this weekend before upgrading my production network.

@ebaauw , do I remember correctly that you did a lot of these investigations into behavior of the IKEA coordinator? Does it do that too?

I added support for most of the Trådfri devices, but I haven't been able to sniff any ZigBee traffic to/from the Trådfri hub. It only uses touchlink pairing, and I've yet to try and capture that to recover the network key. That is, assuming it's possible at all.

I have been running the new FW on the conbee 1 stick in my ~40 node setup since the release of the new firmware and it has been working great without any issues! (In the picture, the nodes not connected are either out of battery or not powered)
image
image

Thanks to all involved that have troubleshoot and updated the code. It is SO appreciated! <3

Not working for me. Almost no connections in the grid. Have waited about 20 min after restart.

Raspberry Pi 3 Model B Rev 1.2
Conbee II
version 2.05.74
version 26530700

77 nodes

3 Connection to
1 TRÅDFRI on/off switch
1 Xiaomi multi sensor
1 Philips dimmer switch
I'm able to trigger events from the philips dimmer switch.
These that have connections are quite close to the Conbee stick. The stick is located in the garage. I have some bulbs in the garage that has no connections.

No connection to
A lot of ikea bulbs, both E27 and GU10
A lot of Xiaomi multi sensors
A lot of Ikea TRÅDFRI remote control
And other nodes.

Logs
feb 27 09:29:06 raspberrypi systemd[1]: Started deCONZ: ZigBee gateway -- GUI/REST API.
feb 27 09:29:06 raspberrypi deCONZ[7204]: libEGL warning: DRI2: failed to authenticate
feb 27 09:29:06 raspberrypi deCONZ[7204]: libpng warning: iCCP: known incorrect sRGB profile

When downgrading to 264A0700 it start building the mesh directly.

@ebaauw , do I remember correctly that you did a lot of these investigations into behavior of the IKEA coordinator? Does it do that too?

I added support for most of the Trådfri devices, but I haven't been able to sniff any ZigBee traffic to/from the Trådfri hub. It only uses touchlink pairing, and I've yet to try and capture that to recover the network key. That is, assuming it's possible at all.

Right, sorry. Should have checked git blame. The code which mentions the behavior of the IKEA gateway was actually introduced in 48d2c39a267b5c6d025577eed7530be27932aa2c by @manup ...

@manup , did you indeed identify that the IKEA gateway re-configures the attribute reporting this often? Why would re-configuring be required; does the light need to be reminded regularly?

Upgraded Conbee I to beta FW and deCONZ .74

Mesh builds immediately and looks really nice!

Big thanks to @djwlindenaar . I`m extremely impressed that you come from nowhere and find such severe bugs. And thanks @manup as well of course for fixing them...

Conbee I and .74, too. Upgraded Ikea FW to 2.3.007 (some others 2.x).
Major improvement! No dropouts yet!

A big thank you to everybody contributing, developing, debugging, testing in this thread and beyond.

Something I found out in the process:
Normal group ON-OFF works- quick, but after recalling a scene (after fade in time) and turning it off again (<10sec) lights go only out in the GUI (regardless old or new gui) Then some of them get back on (in the gui) in the group while ALL physical bulbs still stay on. After pressing a second or third time OFF they go dark sometimes.

It's not uncommon to restore/rebuild scenes after a firmware upgrade
But regardless if I build a new scene, or try to fix an older one, the behaviour stays the same. Normal on-off is not affected.

I found out that it works as usual when I wait a little longer. Let's say 15-20 seconds. Then lights go out as usual. I assume deconz get's confused like it does when you turn off a light until it fades in.

It seems we got an increase of delay until every light state is reported back and the scene recall is done in deconz or reported successfully so other actions in that group work as desired. This seems to take a little longer now. Within this period - you shall not (pass ;) ) - switch lights off. This is not critical.

I tested a little further. We have a change of behavior. Previously when a bri was changed with a given speed, or a scene had a longer transition time it was possible to interrupt with a new command. Even the previous action was still performed, the next command was queued. Now it's lost. E.g.my floor lights are motion sensor operated. Before they go of I dim them and wait for 10 sec before turning them off. When I triggered motion and recall the scene while they are fading the command is lost. Before that it was queued and executed afterwards. Is this due to. 74 or the new firmware? Thanks

My Conbee II did not build mesh on the beta version. Only thing I can think of that could be different from "normal" settings is setting the channel to 25. Going back to 264a0700 fixed it

@realwax, I think you're hitting a Trådfri firmware bug, see #2068. You can easily verify this by issuing a command with a longer transitiontime in the GUI and then issuing a second command.

Before that it was queued and executed afterwards.

Did you update the light firmware? Are you sure you're using the same light? The REST API plugin doesn't keep track of the transitiontime once it has sent a command to the light.

@ebaauw Thank you for leading me into the right direction. As IKEA stated to include network and stability improvements in their releaselog I upgrade my TRÅDFRI Lampe E14 WS opal 400lm to firmware 2.3.007 (latest). I thought it might ease some of these issues. I wonder how IKEA does the trick on their own bridge because this is a big usability issue. Now I have to go for quicker transition times as you stated in the other issue. I experience this since day 1 but it got worse. Now it not only affects scene recall fades it affects any change with a longer transition and that's new... What can I do? Flashing back to older fw is nearly a no go, because it's a struggle with deconz. Thank you Erik.

I wonder how IKEA does the trick on their own bridge because this is a big usability issue.

The Trådfri hub is far less active that the deCONZ gateway or Hue bridge. The Trådfri controllers (switches, dimmers, but also their motion sensors) control the lights directly. The Trådfri lights send push notifications to the hub with state changes. The hub is only needed/used for their app. It might run some alarms, but no rules to handle button events or anything fancy.

I'm not sure if the Trådfri app even supports specifying longer transition times, but the controllers I've seen don't. You probably won't be able to send commands from the controllers faster than the default transition time of the lights. And if you do on occasion, you probably think you didn't fully press the button.

@ebaauw I see. What still puzzles me is the fact I can switch on and off a group of E14 (providing some kind of light disco ;) ) within 1 second (on-off-on) it works! But as soon as I hit a scene recall button in deconz I can't and the behaviour gets weird. This is where I doubt somehow it's IKEAs fault. Why can I switch on and off really quickly but as soon as I recall a scene I am stuck at least for 5-10 seconds?

Precisely measured times (20 tries):
group of 8 E14
group ON -> group OFF - switch times 1 second
recall scene -> switch on -> OFF not responding until 10-12 seconds!

I understand that with a recall more traffic is generated and every bulb gets more messages, but a difference of 10 seconds? Even when I change bri and ct for a whole group I can turn it off in a second but again a recall is like a freeze. I think this seems to be a deconz issue, isn't it?

I can video record it. ;) Maybe I am lacking of knowledge here, then please excuse me presenting my lack of understanding but I am somehow frustrated cause it's a struggle...

I'm afraid I'm still a bit confused about scenes. Like groups, they don't exist as an object in ZigBee, they're just a number under which a device (light) stores a state. On a scene recall, the number is transmitted, and each device restores the state from its non-volatile memory.

The fuzzy part is that not all devices seem to store their state correctly when storing the scene: colour temperature is notorious is this respect. I've also seen funny stuff storing the scene while the light is still in transition. In that case, the /scenes resource is out of sync with the state stored on the light, causing the API to reflect the resource state instead of the actual light state, which will only be updated the next time the light is polled. Typically the transition time is specified when recalling the scene, but it would seem it's in the stored state as well.

I've scripted the setup of my production network (using ph), so I can re-create it easily. I had a very hard time scripting the creation of scenes in a predictable way. I ended up setting the light states, sleeping for a couple of seconds, waiting for the states to settle, storing the scene, and again sleeping a couple of seconds, waiting for the state to be stored.

You might try and re-create your scene, but no guarantees...

I could reproduce the out of sync topic from day one. I was used to deconz and the need of scene reconfiguring from time to time. Now it's really worse. Currently I would need to source out from deconz to iobroker an rebuild all my scenes there. But this is a lot of work. Hopefully this gets fixed. May I create an issue? Lights are not reliable anymore when using scenes... - I try to recreate as well. I did a "test" scene yesterday in addition to already existing ones but it showed no other behaviour.

Not sure if it is related but after upgrading my Conbee I to the beta-firmware and deCONZ to .74 it took one day and now deCONZ lost the connection to the Conbee. Never experienced this in years before...But I wanted to mention it...The log just stated retries to connect with it over and over again...Restart of deCONZ solved it... (using the docker container)

I just tried to recreate a group, all scenes... In the creating process there is so much going wrong. When using Phoscon and the "ALL" bulbs selection, the scenes values do not get stored properly. Even the configured lights show what you desired, a recall does not. You have to control each light on it's own even when you go for the same settings. Particularly the old gui must be used to correct color temp errors of wrongly stored colors. Until a scene "works" you have to go over that a couple of times, maybe switching lights on and off while in the scene config process so that it gets stored properly. I do not have much technical details here but I want to courage all of you in the thread to test out scenes, creation, storing, recall and switch off after a recall. I think this is messed up with IKEA bulbs and it gets worse... May it's with IKEA but I doubt as everything else and direct group control works like a charm with deconz.

I will now rollback/downgrade the firmware to 1.x and see if it's works better. I will report back.

OK. Back to the drawing board. Yesterday 2 IKEA lights took a leave of absence, only to come back after power a cycle. Not saying that the bugs that were fixed weren't bugs. They were. Just not the ones causing the issue.

I was looking more deeply into the attribute reporting thing. I was wondering whether the repetitive configuring of the attribute reporting every 30 minutes (1800s) is causing the hangups of the light firmware.
I noticed this bit of code which doesn't seem to handle rq.maxinterval == 0 explicitly. Now how to properly handle this case is a bit difficult, since rq.maxinterval == 0 means that the light will only report a change, so there's no 'good' timeout for that case... Maybe the current implementation is fine, although I wonder if a better idea may exist.

bool DeRestPluginPrivate::sendConfigureReportingRequest(BindingTask &bt, const std::vector<ConfigureReportingRequest> &requests)
{
<hidden>
        if (val.clusterId == bt.binding.clusterId)
        {
            // value exists
            if (val.timestampLastReport.isValid() &&
                val.timestampLastReport.secsTo(now) < qMin((rq.maxInterval * 3), 1800))
            {
                DBG_Printf(DBG_INFO, "skip configure report for cluster: 0x%04X attr: 0x%04X of node 0x%016llX (seems to be active)\n",
                           bt.binding.clusterId, rq.attributeId, bt.restNode->address().ext());
            }
 ```

I Did some experiments, including asking the IKEA lights to report `ONOFF` and `LEVEL` periodically instead of only when a change is made. The lights happily report their status periodically, so that may be an acceptable way to avoid the above issue. To be verified properly of course.

Anyway, while doing these experiments, I stumbled upon an actual bug. I noticed the magical Default Response Command being returned whenever the IKEA lights now report their attributes. So I looked into what that thing is. Apparently that is supposed to conclude an ZCL/APS transaction when requested. There's a bit in the ZCL packet which dictates whether or not it should be sent `Disable Default Response`.

For attribute reports these are handled nicely by deCONZ

No. Time Source Transmit Dev Receive Dev Destination Disable Default Response Info
208134 10h 43m 23.151s Gang 1 Gang 1 DeConz DeConz False ZCL: Report Attributes, Seq: 15
208136 10h 43m 23.158s DeConz DeConz Gang 1 Gang 1 APS: Ack, Dst Endpt: 1, Src Endpt: 1
208138 10h 43m 23.212s DeConz DeConz Gang 1 Gang 1 True ZCL: Default Response, Seq: 15


However for Configure Reporting Response command, deCONZ fails to send the Default Response. I'm not sure how the IKEA lights handle this situation, but it may be a cause for a memory leak. Remembering that the Default Response is a kind of closure of the transaction, it may be that the firmware only releases a certain amount of memory after it is received.

No. Time Source Transmit Dev Receive Dev Destination Disable Default Response Info
207941 10h 43m 8.422 DeConz DeConz Gang 1 Gang 1 True ZCL: Configure Reporting, Seq: 41
207949 10h 43m 8.481 Gang 1 Gang 1 DeConz DeConz False ZCL: Configure Reporting Response, Seq: 41
207951 10h 43m 8.485 Gang 1 Gang 1 DeConz DeConz APS: Ack, Dst Endpt: 1, Src Endpt: 1
207952 10h 43m 8.487 DeConz DeConz Gang 1 Gang 1 APS: Ack, Dst Endpt: 1, Src Endpt: 1
207954 10h 43m 8.493 Gang 1 Gang 1 DeConz DeConz APS: Ack, Dst Endpt: 1, Src Endpt: 1


I'm going to test this hypothesis with this patch in place:
```diff
diff --git a/bindings.cpp b/bindings.cpp
index 9607b09..0c2b5fc 100644
--- a/bindings.cpp
+++ b/bindings.cpp
@@ -443,6 +443,12 @@ void DeRestPluginPrivate::handleZclConfigureReportingResponseIndication(const de
         allNodes.push_back(&l);
     }

+    // send DefaultResponse if not disabled
+    if (!(zclFrame.frameControl() & deCONZ::ZclFCDisableDefaultResponse))
+    {
+        sendZclDefaultResponse(ind, zclFrame, deCONZ::ZclSuccessStatus);
+    }
+
     for (RestNodeBase * restNode : allNodes)
     {
         if (restNode->address().ext() != ind.srcAddress().ext())

I noticed this bit of code which doesn't seem to handle rq.maxinterval == 0 explicitly. Now how to properly handle this case is a bit difficult, since rq.maxinterval == 0 means that the light will only report a change, so there's no 'good' timeout for that case... Maybe the current implementation is fine, although I wonder if a better idea may exist.

Yeah, the REST API plugin checks that the attribute reporting is working. If not, it tries to reconfigure it, and, I think it also falls back to polling the device.

I Did some experiments, including asking the IKEA lights to report ONOFF and LEVEL periodically instead of only when a change is made. The lights happily report their status periodically, so that may be an acceptable way to avoid the above issue. To be verified properly of course.

I think we ran with that setting for quite some time. It was changed, together with less frequent polling of the neighbour tables and no longer polling the state, because some of the Trådfri firmware would hang (requiring a power cycle of the light). I doubt that the reporting setting actually contributed to the firmware crash.

There's a bit in the ZCL packet which dictates whether or not it should be sent Disable Default Response.

I don't think sending the default response would so much harm, when Disable Default Response bit is set. Not sending the default response when the bit is not set will do harm, since the device waiting for the response might conclude the coordinator is no longer reachable and eventually leave the network.

There's a bit in the ZCL packet which dictates whether or not it should be sent Disable Default Response.

I don't think sending the default response would so much harm, when Disable Default Response bit is set. Not sending the default response when the bit is not set will do harm, since the device waiting for the response might conclude the coordinator is no longer reachable and eventually leave the network.

@ebaauw , indeed. That's exactly the point. deCONZ fails to send a Default Response in reply to a Configure Reporting Response. It happens that IKEA lights are requesting a Default Response for Configure Reporting Response.
So, as you say, that may be the reason, coupled with the Configure Reporting Request every half hour, for IKEA lights to go awol.

Just a reminder:
Ikea bulbs (E27 & GU10 v1) occasionally become unreachable and need a power cycle when connected to a HUE bridge as well, so that particular issue is not unique to Conbee I/II
Out of 16 E27 and 12 GU10 on my HUE bridge, I would say one bulb 'hangs' per 1-2 weeks, roughly. Sometimes longer, sometimes quicker. This issue improved with the latest HUE firmware releases over the past year and a half.

@all Which Tradfri firmware version are you using?

Are you on 1.x or 2.x? With the 2.x they introduced zigbee 3.0. I upgraded to 2.x and the trouble began. E14 bulbs. I noticed improvements regarding network speed and connectivity. But two things made me roll back 20 bulbs sigh The "soft on" was not working anymore. Bulbs turned on without a fade in. Scenes were not working as they used to. Precisely speaking after recalling a scene a wait time before turning light off or recalling next szene was needed otherwise deconz went async and resync to on while bulb did nothing.

I appreciate your shared experience with versions and deconz 'flawlessness' with 2.x it seems not given.

Is there a recommendation? FW v.? Beside the lost connection issue and the connection issues on Ikea's fw itself it seems deconz can handle 1.x better.

Thanks!

I am using:

  • Conbee 1 with 0x26340500 firmware
  • Deconz version 2.05.73 (using marthoc docker container on Debian)
  • ~60 nodes, mostly IKEA Trådfri
  • Conbee 1 (coordinator) is NOT installed centrally in my 200 sqr meter house. The system relies heavily on roaming and meshing.
  • The Conbee 1 stick is installed on a 0.5 meter USB extension cable and mounted on the side of the shelf where the computer resides to have a more free space for the RF signals.

Since the upgrade (same day as released) of the Conbee 1 sticks firmware deconz has been working like a charm! No issues what so ever.

Upgraded my production network (RPi 3B+, stretch, RaspBee) to 2.05.74 and 0x26340500 yesterday. Seems stable, except for the issues below.

Not sure if related, but the route to my lumi.curtain curtain controller was lost this morning. Reports by the controller would still reach the gateway. The route wouldn't be restored on power cycling the controller. I had to open the network and power cycle the controller, before the coordinator would reach it again.

Also, one of my Eurotronic Spirit radiator valves was unreachable by deCONZ after starting deCONZ, while still sending reports. Power cycling the TRV brought it back, as usual.

I didn't have the opportunity to do any deep investigation this time, but both devices have been problematic from the start, exhibiting these symptoms every once in a while. Same for my IKEA Fyrtur blind. I'll continue to monitor the situation and report back if running into more issues.

@tubalainen
Which firmware are you using on your tradfri bulbs? This makes a difference even in this thread in respect to lost connection issue as far as I tested. My results are that the steps taken here improved the operation of 1.x operated tradfri bulbs on Conbee 1. But a network of 20 e14 with tradfri fw 2.3.x is a mess. Timings, scenes stuck, deconz gets, light starts blinking(lost?) ,.. as reported above. I think this is a point to be discussed and mentioned to put out a clear recommendation of ikea fw to use for a good experience. Maybe there is a git article already. But from my experience do not upgrade 😅

So from my point of view and hours of testing and flashing. Thank you for the improvement for ikea fws 1.x! Is it possible to mitigate the current issues when 2.x is operated? Otherwise it won't be possible to upgrade to zigbee 3 with ikea currently. It seems like behavior changed and operating them in deconz must be adapted. This probably for @ebaauw to judge or deal with?

Cheers
Have a good Sunday 😊

@realwax
I have tried to FW upgrade all my entities to the latest FW. Here is my list of all (currently active) nodes.

Agreed on the mess with all "moving parts" at the same time trying to nail down the root cause of the problems.

image

@tubalainen

Thank you for your list.

Looking especially on to the routers (bulbs) I see that you operate 2.3.007 on your E14s. Can you reproduce my issue list. https://github.com/dresden-elektronik/deconz-rest-plugin/issues/2518
As I was unaware of automatic fw upgrade to 2.3.x my network was mixed with 1.x and 2.x not very good. I upgraded by hand to 2.3.x and then it got worse. (Network faster but massive usability dran and, blinking bulb drop outs) So I can recommend if you experience blinking bulbs on your e14 or "laggyness" on scences downgrade them to 1.2... I would be interested in getting an "official" / professional statement form our pro devs here about this. I pretty sure there was a time were deconz dealt with 1.2 similarly and it was improved. I feel like this needs to be done for 2.3.x as well or Ikea messed it up their own. Hard to say as I not deep into code.

@realwax
I am using the otau feature of deconz and its firmware update script to download the fw files.
I do not know why the E14s are not updated ... huhum.

How did you "manually" update/downgrade the bulbs?

Well well. It all works fine and the majority of the routing is done by the E27s with fw 2.3.x and the Jormen and FLOALT panel led drivers.

@tubalainen
Interesting that you don't experience that issues. Maybe it's because of the mesh size. I have 23 bulbs from that 20 e14 on 2.3.007. I deactivated the automatic otau since it messed up my usability with the new firmware. Via Gui you can downgrade with the update button. Choose the proper firmware first, press update and maybe again. Form status paused it goes to ->queued -> idle ->start firmware update (percentage). Sometimes it hangs. Sometime it need a reboot. Sometime a power cyle is enough. Sometimes you need to bring the bulb closer to the coordinator.

It seems like behavior changed and operating them in deconz must be adapted. This probably for @ebaauw to judge or deal with?

Not sure what you mean here. I handled some differences between ZLL and ZB3 firmware for the controllers (Trådfri remote and Trådfri wireless dimmer), see #2485. This is at APS level in the ZigBee stack, and handled by the REST API plugin.

The routing issues from this topic are at NWK level, which is handled by the device firmware. Like every-one else here, I don't have access to the firmware sources. Even if I would, there's nothing I could do, as I don't know the details of the NWK and MAC layers.

@ebaauw
https://github.com/dresden-elektronik/deconz-rest-plugin/issues/2518
Precisely speaking. I included my findings with 20 E14 2.3.007 mesh here. Some features are gone and scene recall pretty much messes everything up for 10 to 15 seconds in that group. I don't know if this is only firmware related or deconz related. This is what I meant with change to deal with. So for daily operation I see 2.3.007 as no go. Usability example given: A simple switch rotating scenes configured in phoscon does not work anymore when not pressed carefully meaning after a scene rotation to wait. While in 1.x everthing is quick and fine with 2.3.x it get stuck.

Thanks, @manup , just installed it. Let's see what it will bring.
Although I expect this will bring some improvement, how do you feel about DeConz firmware ignoring the recorded routes (due to the mane-to-one routing)? In general, I've seen that the recorded routes are much more logical than the ones used by DeConz firmware.
Although I can imagine enabling source routing is a big job, DeConz firmware could (should?) still use the information from the route record packet to update the first hop to a device. Maybe just check whether the last hop to the coordinator matches what is stored in the routing table and if not, invalidate the entry in the routing table. I'm not sure if it's OK to replace the entry with the last hop from the recorded route, since the devices along the way probably ignore the information, but at least it could initiate a new route discovery for that node by DeConz firmware.
What do you think about this?

Hmm strange the firmware should already use these Route Records to establish new Routes, need to check the code here, but if my memory services me right this was enabled a while ago.

@manup , did you find some time to look at this? This morning I found a situation where I power-cycled a light (IKEA) which was 4 hops away from the coordinator. Somehow a router inbetween (also IKEA) decided it didn't know the route to this light anymore. I actually see the light happily doing its job in routing for other lights, reporting link status and responding to Network Address Request from deCONZ. This last one is happening only because those are broadcast messages on the network..!
However, the router in between is silently dropping any unicast frames it should route to this light. This router should, of course not silently drop them, then again, deCONZ should be robust against this bad behavior.
In the meantime, the light does happily send route record messages to deCONZ as well, which arrive and are ignored by deCONZ.

I think there should be some logic which should trigger deCONZ to reconsider it's routes in this case. Especially, when it detects that the ZCL requests are not being replied to. Which in the end leads to the light getting marked as zombie. The discovery that follows the marking as zombie actually does lead to a reply from the ligtht. Maybe when a discovery for a zombie is started, it should also invalidate whatever route information is available. But better even, if already sooner the route is invalidated when a couple of ZCL requests are not replied to (probably already upon the first or second of those).

What do you think about this?

This new firmware did not solve my problem.

I use a Raspbee on a separate Pi and Home Assistant (running on a NUC) and have appx. 25 tradfri lights. Mostly GU10 used in groups of 3.

I had big problems with single lights in a light group getting unresponsive and needing a power cycle to come back again. This happened both with Ikea FW v1 and after upgrading the bulbs to 2.3.007.

The solution was to change my config from grouping the lights in HASS to defining light groups in Phoscon and referencing the Phoscon groups as single lights in HASS. After this change I've been running without problems for a couple of months.

I do want to be able to do the grouping in HASS so I upgraded my Raspbee to 0x26340500 and Deconz to 2.05.74 and changed my config back to using light groups in HASS. After running this for a week I've had bulbs going stale 3 or 4 times, and am now switching back to using Phoscon groups again.

I think there should be some logic which should trigger deCONZ to reconsider it's routes in this case. Especially, when it detects that the ZCL requests are not being replied to. Which in the end leads to the light getting marked as zombie. The discovery that follows the marking as zombie actually does lead to a reply from the ligtht. Maybe when a discovery for a zombie is started, it should also invalidate whatever route information is available. But better even, if already sooner the route is invalidated when a couple of ZCL requests are not replied to (probably already upon the first or second of those).

I'm still on the new firmware and test a few things including the route records, hope to get it online this week.

What do you think about this?

That's a good point, I need to check the core and REST-API plugin code here, since I'd think the firmware should degrade the route "quality" already when no APS level ACK is received. The APS ACK is a flag which is set optionally in the ZCL/APS requests and is often disabled to lower the network traffic. So a rough idea is that we should enable APS ACK if the plugin detects that unicast requests lead to timeouts.

Perhaps part of this is already in place, need to check the code.

However, the router in between is silently dropping any unicast frames it should route to this light. This router should, of course not silently drop them, then again, deCONZ should be robust against this bad behavior.

The light should pick up a new route as soon as new route discovery is triggered. So the goal should be detect broken route fast as possible (hopefully APS ACK will do the trick) and trigger route discovery.

The state machine for that is already in place in deCONZ core (this is what leads to the NWK address request broadcast) this works in case of one-hop links and for lights which do pick up routes based on incoming commands (the reply to the broadcast). The broadcast is nice since it also respects changed to the NWK address because the MAC address is included. I will try to send a unicast with enabled APS ACK as next step in case no reply is received.

Unfortunately, yesterday I've had to power cycle an IKEA E27 bulb (White 1000LM, v1 firmware) too. It only reacted to group, but not unicast commands. Seems the issue is not fixed yet :(

(and yes I'm on v74 and the beta firmware for RaspBee)

See the above comments, the next changes might help to recover routing.

What do you think about this?

That's a good point, I need to check the core and REST-API plugin code here, since I'd think the firmware should degrade the route "quality" already when no APS level ACK is received. The APS ACK is a flag which is set optionally in the ZCL/APS requests and is often disabled to lower the network traffic. So a rough idea is that we should enable APS ACK if the plugin detects that unicast requests lead to timeouts.

Perhaps part of this is already in place, need to check the code.

It looks like even when the APS ACK request bit is set, deCONZ does not do anything with the missing ack (only one retry and then nothing...)

BTW houtlamp 2 is the one dropping the packets directed at Tuin linksvoor 2

No. Time    Source  Transmit Dev    Receive Dev Destination Disable Default Response    Info
245915  10h 28m 42.108501s  DeConz  DeConz  HK houtlamp 2   Tuin linksvoor 2    True    ZCL: Read Attributes, Seq: 245
245922  10h 28m 46.033452s  DeConz  DeConz  HK houtlamp 2   Tuin linksvoor 2    True    ZCL: Read Attributes, Seq: 245
ZigBee Application Support Layer Data, Dst Endpt: 1, Src Endpt: 1
    Frame Control Field: Data (0x40)
        .... ..00 = Frame Type: Data (0x0)
        .... 00.. = Delivery Mode: Unicast (0x0)
        ..0. .... = Security: False

        .1.. .... = Acknowledgement Request: True

        0... .... = Extended Header: False
    Destination Endpoint: 1
    Cluster: On/Off (0x0006)
    Profile: Home Automation (0x0104)
    Source Endpoint: 1
    Counter: 107

However, the router in between is silently dropping any unicast frames it should route to this light. This router should, of course not silently drop them, then again, deCONZ should be robust against this bad behavior.

The light should pick up a new route as soon as new route discovery is triggered. So the goal should be detect broken route fast as possible (hopefully APS ACK will do the trick) and trigger route discovery.

The state machine for that is already in place in deCONZ core (this is what leads to the NWK address request broadcast) this works in case of one-hop links and for lights which do pick up routes based on incoming commands (the reply to the broadcast). The broadcast is nice since it also respects changed to the NWK address because the MAC address is included. I will try to send a unicast with enabled APS ACK as next step in case no reply is received.

Actually, the houtlamp 2 never sees a reply to the broadcast address request. The messages to deCONZ are routed through tuin linksvoor 3, so even if houtlamp 2 would pick up that route, it never gets the chance. This is then again caused by deCONZ not picking up the route record as a new route.

ZigBee Network Layer Command, Dst: DeConz, Src: Tuin link
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0x0000[DeConz]
    Source: 0x0ea5[Tuin linksvoor 2]
    Radius: 29
    Sequence Number: 51
    Destination: dresden-_ff:ff:00:c4:9a (00:21:2e:ff:ff:00:c4:9a)
    Extended Source: EnergyMi_ff:fe:e9:91:86 (d0:cf:5e:ff:fe:e9:91:86)
    ZigBee Security Header
    Command Frame: Route Record
        Command Identifier: Route Record (0x05)
        Relay Count: 1
        Relay Device 1: 0xc9fa[Tuin linksvoor 3]

Now Tuin linksvoor 2 sends a reply to the network address request broadcast and succeeds, but the APS ACK from deCONZ never reaches Tuin linksvoor 2, since it is dropped by houtlamp 2. So it resends a couple of times before giving up. That will have a good chance of messing up Tuin linksvoor 2.

No. Time    Source  Transmit Dev    Receive Dev Destination Disable Default Response    Info
246199  10h 29m  1.496384s  Tuin linksvoor 2    Tuin linksvoor 3    DeConz  DeConz      Network Address Response, Status: Success, Address: EnergyMi_ff:fe:e9:91:86 = 0x0ea5
246201  10h 29m  1.502056s  DeConz  DeConz  HK houtlamp 2   Tuin linksvoor 2        APS: Ack, Dst Endpt: 0, Src Endpt: 0

Conbee 1 + latest fw + .74 isnt playing 100%.
Had quite some hit/misses. It seems to work better with .73 for me, but not 100%.

So back to the drawing board. Its still not 100% ok with the new (beta) Conbee 1 fw.

After a couple of days of operation with .74 and 26340500 on Conbee 1 on Tradfir E14 with firmware 1.2.221 can report that:

IKEA light became more stable in terms of dropping out of the network though I only had one bulb that got lost in 4 days. I also found out that if you stress the hell out of your deconz operated zigbee network running on E14 fw 1.2.221. I ran a script fading down a bulb by sending single requests with changed bri every second. In that manner I lost 4 bulbs really quickly. But who wants to that anyway ;)

Still unresolved:
The issue and concern I still have is that Tradfri FW 2.3.x is not running well, or is not implemented well to be used with deconz. It's fine to stay at tradfri fw 1.2.x and not going to zigbee 3.0. But there will be a point in time it can not be avoided. Or newly bulbs be downgraded anymore I am afraid.
I discovered that a group of 2-3 bulbs don't show that issue too bad as a group of 4-8 bulbs.
I reported my finding here and I happy and thankful if it gets picked up. I tried to raise awareness here https://github.com/dresden-elektronik/deconz-rest-plugin/issues/2518
as I can reproduce this issue by simply flashing all my bulb to 2.3.x and I hoping you can do that too.

Bottom line - it makes a difference about which tradfri firmware is used an we are talking about. There is a huge difference in usability and issue experienced and operation with deconz. While 1.2.x fws more likely are older and work like a charm beside that known dropouts, 2.3.x does not an has lost usability as described in the issue raised. I can't imagine I am the only one experience this difference in FWs.

I have now sent an email to dresden elektronik support in German to raised awareness for this as I understood some of you are hobby enthusiast as @ebaauw made clear in another thread. I thought the most of devs are dresden elektronik contractors. So sorry for the misconception and thank you for you contributions. I am curious about the official answer now.

Is Conbee II firmware also fixed now? I saw only Conbee I in the release. Thanks all for your hard work by the way.

This is then again caused by deCONZ not picking up the route record as a new route.

@djwlindenaar turns out I'm to blame here, I've checked the Route Record code commits in the stack repository (ConBee I and RaspBee I). In 2018 I had added a a "fix" (or I thought so) for a similar problem which just disabled updating the next hop address on an incoming route record.

If my memory serves me right, the problem at the time was that we had a large mixed network around 150 nodes and route records seemingly didn't work correctly. The new path back wasn't working correctly. However the code might have worked with the other fix of the NWK status command error codes.

I revert this now, so the route record should update the route to the better path.

Love that paragraph from the Zigbee specification:

A ZigBee router or ZigBee coordinator may maintain a routing table. The information that shall be stored in a ZigBee routing table entry is shown in Table 3-66. The aging and retirement of routing table entries in order to re-claim table space from entries that are no longer in use is a recommended practice; it is, however, out of scope of this specification.

Which basically means every stack handles route aging differently.

So I've checked how it works in our case. In Bitcloud Zigbee stack routes have a route "rate", which is initially 1.

  1. On a successful NWK transmission the rate gets incremented if its below the maximum which is
    (1 << 8) - 1 = 255
  2. On a successful NWK transmission, if the maximum of 255 is reached, all routing table entries get "normalized"
    rate = (rate >> 1) + 1 (effectively divided by 2, with a minimum of 1)
  3. On a failed NWK transmission the rate of the related route entry is set as:
    rate -= (rate / 2) > 0 ? rate / 2 : 1
  4. On too many failed transmissions the rate becomes 0 and the route gets removed

This means a top link will degrade as:

255  top rate
127  first failed transmission
63   ...
31
15
7
3
1
0    the record gets removed

Therefore it takes 8 failed commands until an entry gets removed and discovery will be started again.
That's quite ok in normal networks especially when they grow to 50 .. 100 nodes, a route should not be discarded too early.

What concerns me is point (2) because for example a very fresh route or one which isn't used that often would be degraded by completely unrelated high performing nodes with good links, for example a Philips Hue light which is polled every few seconds will trigger (2) quite often multiply that by the number of lights in a large network. Not to mention active OTA updates.

I think it's safe to change (2) to not degrade (normalize) all route entries but only the 255 top route related to the successful transmission. This should prevent loosing routes which did work fine but weren't used often and were removed on the first failed NWK transmission.

I'll build a new firmware with these changes tomorrow, also one for ConBee II likely the same applies there.

I revert this now, so the route record should update the route to the better path.

OK, sounds good. I'm looking forward to test!

Love that paragraph from the Zigbee specification:

Yeah I guess that this kind of specifying is giving us a hard time getting all vendors to coexist nicely. :smile:

In Bitcloud Zigbee stack routes have a route "rate", which is initially 1.

I don't really get the logic behind rule 2. It seems a kind of poor-man's version of ageing. Which works quite OK if all nodes see a similar amount of traffic, but indeed, if it's unbalanced (which I think is quite common), it will go wrong.
I noticed the ZStack is using an actual expiryTime field in their routing table, next to a status byte.

3. On a failed NWK transmission the rate of the related route entry is set as:

How does this one actually work?
If I'm not mistaken, a failed NWK transmission actually means the next hop only, since NWK only checks the 802.15.4 MAC ACK. So to check if the end-to-end transmission is OK, we must rely on APS ack. Is that correct? Does it work this way in BitCloud?
Also: if the ack request bit in the APS layer is not set, is that considered a successful transmission (since no ack is to be expected) and is that incrementing the "rate" of that route entry? Because if so, we might be shooting ourselves in the foot by not requesting APS layer ACK all the time.

If it's only based on NWK failures, then this will not help the situation that a intermediate router is misbehaving and we need to add additional logic to take into account APS layer ACK's not arriving. Probably based on similar logic in place to detect 'zombie' routers, but by first invalidating the route table entry for that route.

Firmware version 0x26350500 for ConBee I and RaspBee I is available for testing.

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26350500.bin.GCF

  • Like 0x26340500 all NWK status route errors are handled
  • Fix unfair route degration (see point 2. in https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-596142584)
  • Fix route records didn't update the next hop of a route, note this will now work but only if the current route cost is higher when the route record
  • Fix failed APS requests with enabled APS ACK didn't degrade the route

I'm now looking in the ConBee II code to see if and where these fixes can be applied. Dismissing 0x26520700 and 0x26530700 and basing this on current stable 0x264a0700.

I don't really get the logic behind rule 2. It seems a kind of poor-man's version of ageing. Which works quite OK if all nodes see a similar amount of traffic, but indeed, if it's unbalanced (which I think is quite common), it will go wrong.
I noticed the ZStack is using an actual expiryTime field in their routing table, next to a status byte.

Totally agree, this is now fixed so only the route of a successful transmission gets normalized when rate hits the maximum. The expired timer way of handling this might be another option but has it's own pitfalls at scale. I think having the "rate" way should work pretty well without depending on network size and time.

If it's only based on NWK failures, then this will not help the situation that a intermediate router is misbehaving and we need to add additional logic to take into account APS layer ACK's not arriving. Probably based on similar logic in place to detect 'zombie' routers, but by first invalidating the route table entry for that route.

This seemed to be the case and indeed misbehaving routers, which don't send NWK status commands with route failures, could keep a dead route alive. This is now fixed in 0x26350500 but it relies on APS ACK enabled commands. Which should be fine and can be controlled by deCONz and the REST-API plugin.

Firmware version 0x26350500 for ConBee I and RaspBee I is available for testing.

Flashed it, now testing.

This is now fixed in 0x26350500 but it relies on APS ACK enabled commands.

What do you do in response to a missing ACK? Do you halve the rate value same as with a failed NWK transmission or more/different? Because I think a missing APS ACK should be counted as more severe compared to a failed NWK transmission. Otherwise, again in the case of a misbehaving router, the succesful NWK transmissions may increase the rate more than the missing APS ACKs lower it.

What do you do in response to a missing ACK? Do you halve the rate value same as with a failed NWK transmission or more/different? Because I think a missing APS ACK should be counted as more severe compared to a failed NWK transmission. Otherwise, again in the case of a misbehaving router, the succesful NWK transmissions may increase the rate more than the missing APS ACKs lower it.

I thought the same, here is what's happening in that case:

  • NWK transmission positive MAC ACK rate + 1
  • No APS ACK rate = rate / 4

So the the rate drops fairly quickly, might be a bit too aggressive and rate / 2 is enough, but lets see how it works in practice.

Nice. I'll run it and report back.

Currently also stress testing by sending unicast messages (OnOffwithLevel) to a light that's quite far away in the mesh, every few seconds.

Btw I also changed the rest plugin code to request APS ACK in all messages.

Im running Deconz on a Rpi 3B+ using Home Assistant
Currently running 2.05.75 with Conbee I and 26330500

Noticed this evening some lights were not turned on.
Tried to turn them on manually, see the log below.
Checked the VNC mesh, it is part of the mesh network but not reacting to anything.
This node is a Tradfri on/off outlet switch.

Strange thing here:
_I CAN put the switch on when enabling the Deconz group instead of the individual switch._

19:23:58:979 delay sending request 57 dt 0 ms to 0x000D6FFFFEB1C9FF, ep: 0x01 cluster: 0x0004 onAir: 1
19:24:15:744 Current channel 25
19:24:15:776 Device TTL 3920 s flags: 0x7
19:24:46:764 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:25:10:547 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:25:15:749 Current channel 25
19:25:15:782 Device TTL 3860 s flags: 0x7
19:25:33:885 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:25:49:411 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:26:12:765 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:26:12:765 max transmit errors for node 0x000D6FFFFEB1C9FF, last seen by neighbors 25 s
19:26:15:742 Current channel 25
19:26:15:774 Device TTL 3800 s flags: 0x7
19:26:48:221 0x000D6FFFFEB1C9FF error APSDE-DATA.confirm: 0xA7 on task
19:26:48:221 max transmit errors for node 0x000D6FFFFEB1C9FF, last seen by neighbors 60 s
19:27:03:845 saved node state in 9 ms
19:27:33:634 sync() in 29789 ms

Latest fixes are in 26350500...

@djwlindenaar I`m holding my breath awaiting your test results :-)

So far so good. I can see deCONZ happily switching routes. :smile:

Latest fixes are in 26350500...

Sorry, misread the FW version.
I want to help testing, using the official Home Assistant Deconz add-on, not sure how to update this one to a beta firmware. (official updates are OTA)

Dear @manup ,any status on Conbee ii fw update?

@manup , looks like there is no action to the Many-to-One Route Failure (0x0c). I'd expect deCONZ to do again a MTORR (unless one is already in progress). I do get these messages regularly.

Command Frame: Network Status
    Command Identifier: Network Status (0x03)
    Status Code: Many-to-One Route Failure (0x0c)
    Destination: 0x499e[Tuin rechtsachter 2]

@manup , it seems that the route record is still not always taken over by deCONZ. For example:

No.                Time               Source             Transmit Dev       Receive Dev        Destination        Disable Default Response              Acknowledgement Request               Info
700766             13h 27m 35.628249s Tuin linksvoor 2   Tuin linksvoor 2   DeConz             DeConz                                                   Route Record, Dst: 0x0000
700784             13h 27m 39.591343s DeConz             DeConz             WC lamp            Tuin linksvoor 2   True               True               ZCL Level Control: Move to Level with OnOff, Seq: 162
700786             13h 27m 39.597002s DeConz             WC lamp            Tuin linksvoor 1   Tuin linksvoor 2   True               True               ZCL Level Control: Move to Level with OnOff, Seq: 162
700788             13h 27m 39.600699s DeConz             Tuin linksvoor 1   Tuin linksvoor 2   Tuin linksvoor 2   True               True               ZCL Level Control: Move to Level with OnOff, Seq: 162

The light Tuin linksvoor 2 is sending directly to deCONZ, but the path from deCONZ is through sevaral hops...
It would be helpful if you could give some insight into the decision making process in deCONZ to yes/no accept the route from the route record message.

--- edited the below items, because there was a flaw in my reasoning. Now hopefully not :wink: ---

Having said that, the route that is used by deCONZ is actually a lot more reliable than the direct communication to Tuin linksvoor 2. This got me wondering and looking into the many-to-one routing. I'm thinking that maybe the transmit power of the Raspbee is not ideal...
This is my reasoning:

  • Many-to-one routing is based on the receipt of broadcasts
  • Unlike normal route handling, the maximum of incoming and outgoing path costs is taken as the cost for the next hop
  • Now if a router or the coordinator is much better at transmitting (high TX power) than it is at receiving (low RX sensitivity), this may lead to many-to-one routing not working very well.

A related point I noticed is that deCONZ seems to be (very) optimistic about the incoming cost in it's route table. When Tuin linksvoor 2 sends a message to deCONZ directly, it needs a lot of retries, all the time. Now if I look in the route table I see the following, and I would not expect difficult transmission at an incoming cost of 4. Note that basically all items in the route table have a much lower incoming cost than outgoing cost.

Link 4
    Address: 0x0ea5[Tuin linksvoor 2]
    .... .100 = Incoming Cost: 4
    .111 .... = Outgoing Cost: 7

So if deCONZ were less optimistic about the incoming cost, we may be seeing better route behavior. Or if we would increase the TX power to be more balanced (similar cost for in- and outgoing)
Or if we would lower the TX power of deCONZ, I expect it will need to route though more hops (the cost 7 devices would drop out of the route table), but also it will be easier for incoming messages to be received.

Total list from deCONZ link status message:

Command Frame: Link Status
    Command Identifier: Link Status (0x08)
    .1.. .... = Last Frame: True
    ..1. .... = First Frame: True
    ...1 0010 = Link Status Count: 18
    Link 1
        Address: 0x0118[Buiten - R schuur]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 2
        Address: 0x0143[HK stalamp]
        .... .010 = Incoming Cost: 2
        .101 .... = Outgoing Cost: 5
    Link 3
        Address: 0x05b5[HK houtlamp 2]
        .... .010 = Incoming Cost: 2
        .101 .... = Outgoing Cost: 5
    Link 4
        Address: 0x0ea5[Tuin linksvoor 2]
        .... .100 = Incoming Cost: 4
        .111 .... = Outgoing Cost: 7
    Link 5
        Address: 0x1ad3[HK plafond ledstrip]
        .... .010 = Incoming Cost: 2
        .001 .... = Outgoing Cost: 1
    Link 6
        Address: 0x1ff6[Tuin linksvoor 1]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 7
        Address: 0x4b4d[Keuken mid]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 8
        Address: 0x5693[Keuken Rechts]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 9
        Address: 0x68c4[WC lamp]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 10
        Address: 0x6c35[Buiten - L schuur]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 11
        Address: 0x731e[Kerstverlichting]
        .... .011 = Incoming Cost: 3
        .011 .... = Outgoing Cost: 3
    Link 12
        Address: 0x7d2a[0x7d2a]
        .... .011 = Incoming Cost: 3
        .101 .... = Outgoing Cost: 5
    Link 13
        Address: 0xa3f5[HK zoutlamp]
        .... .010 = Incoming Cost: 2
        .001 .... = Outgoing Cost: 1
    Link 14
        Address: 0xc7bc[Hal lamp]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 15
        Address: 0xc9fa[Tuin linksvoor 3]
        .... .011 = Incoming Cost: 3
        .111 .... = Outgoing Cost: 7
    Link 16
        Address: 0xd6b7[Zolder Noord Lamp]
        .... .010 = Incoming Cost: 2
        .101 .... = Outgoing Cost: 5
    Link 17
        Address: 0xde81[On/Off light 36]
        .... .010 = Incoming Cost: 2
        .001 .... = Outgoing Cost: 1
    Link 18
        Address: 0xefd5[Zoldertrap Lamp]
        .... .010 = Incoming Cost: 2
        .001 .... = Outgoing Cost: 1

Sorry for the delay. Status: We're still working hard on the new release, and still tracking down bugs in the firmware which seem to be more complex than we thought. Something in the nvram handling seems to be off. We're also looking in getting the usb enumeration/startup issues tackled down.

I'll post updates here as soon as they are available.

Update: still going strong!

I don't think I've ever seen the network this stable. I've by now changed various rules to use unicast instead of groups. Now for 2 weeks and 0 IKEA lights went missing.

(Hope I don't jinx it now)

Note that I did change the REST plugin to request APS ack for all requests.

I’m experiencing an much improved network as well. Not 100% though. There has been occasional bulbs that did not respond but when I manually altered the IKEA bulb it did respond. I see also that status of the bulb is now as the physical status is.
However, one of my Osram bulbs got irresponsive behavior as the ikea bulbs had before. The positive is it is not as such severe behavior as the ikea did.

Hope this can be confirmed by others or findings identified.

I am running Conbee 1, with FW 26350500 and Deconz 2.05.75.
This is my experience the last weeks

  • Works better but not 100%
  • Some of my IKEA E27 TRÅDFRI bulb E27 WS opal 980lm with fw 2.3.007 sometimes fails to answer to OFF commands
  • I can just try to turn them off again and it usually works (no need to power cycle)

@djwlindenaar nice! Is your APS fix for the REST API included in the .75 release? Just so I know if that could explain some differences in previous posts...

No it's not. I haven't even created a pull request for it. With that you can build it yourself. I'll do that today or tomorrow.
Also I can share the built rest API plugin, but I only have the one for raspberry 3.

@tubalainen , I'll check if that can be explained with the APS ack. Also I don't have any 2.3 IKEA lights.

Possibly, there's an issue with the retry behaviour in deCONZ. I'd need to do an experiment for that or maybe it's already in the sniff logs.

If you can sniff, getting a sniff of that phenomenon would help. I can help analysing if you can't.

Upgraded my production network (RPi 3B+, stretch, RaspBee) to 2.05.74 and 0x26340500 yesterday. Seems stable, except for the issues below.
Not sure if related, but the route to my lumi.curtain curtain controller was lost this morning. Reports by the controller would still reach the gateway. The route wouldn't be restored on power cycling the controller. I had to open the network and power cycle the controller, before the coordinator would reach it again.
Also, one of my Eurotronic Spirit radiator valves was unreachable by deCONZ after starting deCONZ, while still sending reports. Power cycling the TRV brought it back, as usual.
I didn't have the opportunity to do any deep investigation this time, but both devices have been problematic from the start, exhibiting these symptoms every once in a while. Same for my IKEA Fyrtur blind. I'll continue to monitor the situation and report back if running into more issues.

It's very hard to record these intermittent problems objectively, but I do have the impression that 0x26350500 has brought improvements here. Apart from the devices mentioned above, my network is very stable. I've had some TRVs becoming unreachable from the gateway, but mostly (only?) after restarting deCONZ. I don't think the FYRTUR nor the curtain controller has gone MIA for the last three weeks.

Note that I did change the REST plugin to request APS ack for all requests.

I do have APS Acks enabled in the _Network Settings_ in the GUI, but I'm not sure if this applies only to the messages sent by the GUI, or also to the REST API plugin.

If you want to run with the pull request above, you can also checkout my fork and build that: djwlindenaar / deconz-rest-plugin

@ebaauw , as far as I can tell, this only applies to the commands sent from the GUI

I've got a sniffer running continuously and I not the wallclock time when an issue happens. Usually with that info I can find the packets quite easily...

BTW I've switched a lot of my rules (especially those that trigger often) to unicast. So far that working great. Also I've been running one IKEA light continuously changing the brightness (every 4 seconds or so) for the last 2 weeks. That one is also still fine.

well... I just noticed something funky in my logs. I found that none of the lights which I power cycled would be reporting their attributes. I thought I was onto another bug, but I wasn't, although in a way I think we might want to consider this a bug anyway.

As I mentioned, I had a script running to change the level of a light every couple of seconds. Thinking that this may be accelerating issues with IKEA lights. Turns out this resets the d->idleLastActivity counter, which prevents any Idle tasks from being run. Including configuring the attribute reporting :rofl:

Are you saying that the IKEA lights lose their bindings and attribute reporting configuration after a power cycle?!

Looks like it... Shouldn't they..?

My problem now is that since I upgraded my setup to .75 and 50500 my docker container is losing the Conbee at least once a week. Restart of the container gets things going again...VERY annoying

@djwlindenaar, no, I don’t think they should. Most devices I’ve seen keep these settings in non-volatile memory. I suppose the ZigBee standard leaves room for either behaviour.

While my IKEA bulbs are running way better now, unfortunately, one of my Xiaomi sensors (round temperature sensor) goes unresponsive after a while. I'll try gather some evidence by sniffing in the next days.

I am running Conbee 1, with FW 26350500 and Deconz 2.05.75.
This is my experience the last weeks

  • Works better but not 100%
  • Some of my IKEA E27 TRÅDFRI bulb E27 WS opal 980lm with fw 2.3.007 sometimes fails to answer to OFF commands
  • I can just try to turn them off again and it usually works (no need to power cycle)

@tubalainen Hi, I noticed the different behavior of ikea with 2.3.x firmware in comparison to 1.2.x too. I tried to address it but got no attention. I downgraded my bulbs to 1.2.x and it works like a charm now. On 2.3.x. You can't switch off after a scene recal for a period of time. Normal on off worked. Strange behavior. Maybe you wanna test and contribute here. cheers https://github.com/dresden-elektronik/deconz-rest-plugin/issues/2518

Are you saying that the IKEA lights lose their bindings and attribute reporting configuration after a power cycle?!

Unfortunately they do loose their bindings after a power cycle, the code was adapted a while ago to address that. Bindings and configuration will be restored after a light is power cycled also if no reports are received for a while the process will kick in to restore them.

@realwax will downgrate to 1.2.x on all my E27 bulbs and report back. (takes some time :) ).

Yesterday two E27 lights (with 2.3x FW) did not turn off during the morning sequence.

@tubalainen do you use group commands or unicast commands? (i.e. via the REST api, set state via /groups/ or /lights/)

I think that as long as lights are reachable, actually unicast commands are more reliable. If a group broadcast command is missed somehow, it is not retried. Unicast commands are retried a few times or until delivered.

@tubalainen do you use group commands or unicast commands? (i.e. via the REST api, set state via /groups/ or /lights/)

I think that as long as lights are reachable, actually unicast commands are more reliable. If a group broadcast command is missed somehow, it is not retried. Unicast commands are retried a few times or until delivered.

I use Home Assistant and the RESt api. Do not know what Home Assistant does ...

In my case it is iobroker with deconz plug in and Phoscon... So rest API. The issues appear when using igroups. A group scene recall triggers that the group can't be turned off, nor quickly changed to other group scene settings or being switched off properly, especially within up to 15 seconds after scene recall. It seems like deconz is busy with the command before, or 2.3.x fw bulbs freeze(which I doubt) . Can't debug on zigbee level yet to get a better understanding. Is the grouping feature of deconz a virtual layer that is translated in uni cast commands or is done via grouping options available in the gui/zigbee? Bottom line I use the built in group function otherwise I would need to build this virtual layer in iobroker and there is no reason as grouping feature is good. So if it is the grouping... What is the difference as it seems 100% with fw 1.2.x and not with 2.3.x. What changed? Is it zigbee 2 why they behave differently.

@tubalainen Yes that is a lot of work as you might need to restart from time to time or bring some bulbs closer. I did it twice with 20 e14 tradfri. You should see a huge improvement. Do you notice that your bulbs with 2.3..x do not turn soft on. (fade in) anymore and 1.2.x do?

BUT fingers crossed that more can reproduce it so ikeas fw 2.3.x will get operateable in deconz as there will be a point in time we need to upgrade. Or replace bulbs. Though zigbee 2 would be nice as well.

Thank you all for your efforts!

@realwax @djwlindenaar @manup

Last night one light did not turn off as it should (IKEA E27 with fw 2.3.x). I tested to change brightness on that light that did not turn off and it changed instantly to the brightness setting I picked. Moments after changing the brightness the light suddenly responded well also to the off command.

I have personally now changed all my automations in Home Assistant to first change brightness, wait for 2 seconds then send the turn off command.

So far 100% success rate.

Hope this can be a clue to the investigation.

EDIT: It is always still the lights that are the farthest away from the coordination (the Conbee stick) that fails to turn off. (lights that due to the nature of the frequency band has to mesh)

Hey folks!

Just wanted to pitch in my issues...
After having stability issues with my ConBee II stick I checked the firmware version. It was 26530700. I then downgraded to 264a0700, and after that no application is able to see the stick. I have tried HomeAssistant and deCONZ. The host OS identifies the stick OK and GCFFlasher works.

Hey folks!

Just wanted to pitch in my issues...
After having stability issues with my ConBee II stick I checked the firmware version. It was 26530700. I then downgraded to 264a0700, and after that no application is able to see the stick. I have tried HomeAssistant and deCONZ. The host OS identifies the stick OK and GCFFlasher works.

After downgrading to 26490700 everything seems to be working again... Stable Zigbee mesh for 24 hours now....

Any updates? I really want to switch my whole house to my Conbee II but at the moment it is very unstable. My hue works perfectly, Conbee ii not so much 🥺

My experience with the latest deCONZ .75 with RaspBee FW 0x26350500 is very good so far.

My devices:
4xTradfri 980lu WS lights - FW 2.3.007
17xTradfri 1000lu WS lights - FW 2.0.023
3xTradfri Plugs - FW 2.0.022
3xTradfri Round Remotes E1810 - FW 2.3.014
4xAqara THP sensors

Found another one that crashes the IKEA bulb firmware. (and I think it can be fixed in deConz)

I saw one light not responding this morning. The last communication is shown below.

It looks like deConz does receive the Group Membership Response, but somehow the APS ACK (acknowledging the Get Group Membership Request) sent by the light is not received by deConz (I also do not see a MAC ACK). As a result deConz resends the request. This request has the same number in the ZCL, which crashes the light firmware.

I guess deConz could consider the request acknowledged as soon as the corresponding Response arrives and avoid putting in another request. Right? Is there an API for the plugin which can be called to cancel a request? @manup?

Note that this specific bug is also worked around by not requesting the APS ACK for these requests, which is the default in the current REST API plugin.

No.               Time              Source            Transmit Dev      Receive Dev       Destination       Disable Default Response            Acknowledgement Request             Info
74174             2h 48m 12.832154s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2     True              True              ZCL Groups: Get Group Membership, Seq: 32
74176             2h 48m 12.841977s HK houtlamp 2     HK houtlamp 2     DeConz            DeConz                                                Route Record, Dst: 0x0000
74178             2h 48m 12.847098s HK houtlamp 2     HK houtlamp 2     DeConz            DeConz                                                Route Record, Dst: 0x0000
74180             2h 48m 12.890302s HK houtlamp 2     HK houtlamp 2     DeConz            DeConz            True              True              ZCL Groups: Get Group Membership Response, Seq: 32
74182             2h 48m 12.896074s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2                       False             APS: Ack, Dst Endpt: 1, Src Endpt: 1
74183             2h 48m 12.899402s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2                       False             APS: Ack, Dst Endpt: 1, Src Endpt: 1
74184             2h 48m 12.902460s HK houtlamp 2     HK houtlamp 2     DeConz            DeConz                              False             APS: Ack, Dst Endpt: 1, Src Endpt: 1
74185             2h 48m 12.904330s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2                       False             APS: Ack, Dst Endpt: 1, Src Endpt: 1
74190             2h 48m 14.186599s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2     True              True              ZCL Groups: Get Group Membership, Seq: 32
76346             2h 52m 41.998416s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2                       True              Link Quality Request
76354             2h 52m 43.668838s DeConz            DeConz            HK houtlamp 2     HK houtlamp 2                       True              Link Quality Request
202171            7h 39m 10.905361s HK houtlamp 2     HK houtlamp 2     Broadcast         Broadcast                           False             Device Announcement, Nwk Addr: 0x05b5, Ext Addr: SiliconL_ff:fe:c5:2c:7d

Running v2.05.75 with 0x26350500 for a couple of weeks now. It seems a bit more stable than previous versions, but I'm still occasionally losing the route to my Eurotronic Spirit TRVs, my Fyrtur blind, and my Xiaomi lumi.curtain curtain controller. The latter is a router; the others are end devices. All TRVs have the same hardware/firmware version, but some go MIA more often than others. The symptoms are consistent: the device continues to send reports to the coordinator, but commands from the coordinator only result in an unanswered _Route Request_.

Currently sniffing and analysing the traffic for the TRV that goes missing most often. Reports reach the coordinator in three hops, using two Hue lights along the way. I also captured a _Data Request_ from the TRV to the first light en route, so the TRV seems happy that this is its parent. Match descriptor requests from the TRV for the OTAU cluster go unanswered. The parent reports the next light in its _Link Status_, but not the TRV (because it's an end device?).

The _Link Quality Response_ messages show a neighbour table of 20 entries, but the TRV is not amongst them. A Xiaomi door sensor (that has been stable forever) is. Oddly so is the coordinator, yet the report from the TRV to the coordinator was relayed through another router (which is also in the neighbour table).
OK, now the coordinator is also included in the _Link Status_ and the next report from the TRV is forwarded directly to the coordinator.

Power cycled the TRV. The TRV sends a MAC _Data Request_ to the (former) parent; the router responds with a _Rejoin Response_ passing the TRV's old NWK address as new address. The TRV then broadcasts the _Device Announcement_ (MAC unicast to the parent; the parent forwards as a MAC broadcast). The TRV sends an _End Device Timeout Request_ to the parent; the parent sends an _Update Device_ to the coordinator informing it that the TRV has rejoined securely. The parent now also sends a _Route Reply_ to the coordinator's _Route Request_. In the next _Link Quality Response_ sequence, the TRV is included.

I'll keep the sniffer running, hopefully to catch the moment where the TRV goes missing again.

On a possibly related note: one of my innr SP120 smart plugs still thinks it's the parent for the Hue button, which I briefly joined to my production network while adding support. The button has since been joined to my test network for a couple of weeks now, and I've power cycled the plug several times. Do I need to factory reset the plug to make it forget the lost child?

@manup, long time since any update both in code and info on what is going on. Could you please give us an update on your teams progress on the stability issues and if you dare also an expected timeline.

I've found another issue in the routing behavior of deConz.

In this case, deConz tries to route the message to Hal lamp through Tuin linksvoor 3. But looking at the Link Status report from Tuin linksvoor 3 it does not know about Hal lamp. And apparently it also does not know how to reach it through routing. Of course that light (IKEA) should behave itself and respond with a failure message, but it doesn't and we can't change that.
However, deConz concludes that the Hal lamp is a zombie without any attempt to find a new route to that light. Not sure how that interacts with the (new) routing code, but somehow it didn't degrade that route fast enough to prevent it from being flagged as zombie. (BTW it really isn't, see next ...)

This caused a temporary issue which resolved itself after a few minutes (which is of course completely unacceptable). Because the Hal lamp decides to send an attribute report, for which it doesn't receive an APS ACK and therefore starts a Route Request process. Only now, after this is completed, deConz changes its route to Hal lamp and communication resumes as normal.

I wonder how long it would have taken if the light didn't decide to send a message to deConz. (Note that in my network I'm running the IKEA lights with regular attribute reporting for On/Off and Level clusters every 5 minutes)

No.                Time               Source             Transmit Dev       Receive Dev        Destination        Disable Default R  Acknowledgement Request               Info
392213             13h 31m 26.050526s Tuin linksvoor 3   Tuin linksvoor 3   Broadcast          Broadcast                                                Link Status
392241             13h 31m 26.182875s DeConz             DeConz             Tuin linksvoor 3   Hal lamp           True               True               ZCL Level Control: Move to Level with OnOff, Seq: 252
Command Frame: Link Status
    Command Identifier: Link Status (0x08)
    .1.. .... = Last Frame: True
    ..1. .... = First Frame: True
    ...1 0000 = Link Status Count: 16
    Link 1
        Address: 0x0000[DeConz]
        .... .111 = Incoming Cost: 7
        .100 .... = Outgoing Cost: 4
    Link 2
        Address: 0x0118[Buiten - R schuur]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 3
        Address: 0x0143[HK stalamp]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 4
        Address: 0x05b5[HK houtlamp 2]
        .... .111 = Incoming Cost: 7
        .101 .... = Outgoing Cost: 5
    Link 5
        Address: 0x0ea5[Tuin linksvoor 2]
        .... .011 = Incoming Cost: 3
        .001 .... = Outgoing Cost: 1
    Link 6
        Address: 0x1ff6[Tuin linksvoor 1]
        .... .011 = Incoming Cost: 3
        .011 .... = Outgoing Cost: 3
    Link 7
        Address: 0x23ec[Tuin linksachter 1]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 8
        Address: 0x2b9e[Bijkeuken]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 9
        Address: 0x325d[0x325d]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 10
        Address: 0x6339[Tuin rechtsvoor 3]
        .... .101 = Incoming Cost: 5
        .101 .... = Outgoing Cost: 5
    Link 11
        Address: 0x68c4[WC lamp]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 12
        Address: 0x731e[Kerstverlichting]
        .... .111 = Incoming Cost: 7
        .000 .... = Outgoing Cost: 0
    Link 13
        Address: 0x7d2a[HK houtlamp]
        .... .111 = Incoming Cost: 7
        .111 .... = Outgoing Cost: 7
    Link 14
        Address: 0xc520[Badkamer ledstrip]
        .... .111 = Incoming Cost: 7
        .101 .... = Outgoing Cost: 5
    Link 15
        Address: 0xca27[Tuin rechtsvoor 2]
        .... .101 = Incoming Cost: 5
        .101 .... = Outgoing Cost: 5
    Link 16
        Address: 0xd6b7[Zolder Noord Lamp]
        .... .111 = Incoming Cost: 7
        .000 .... = Outgoing Cost: 0

Your description sounds like what I experience. I’ve a much better network (conbee 1) with latest fw. But still get unresponsive bulbS that after awhile gets responsive again.
I do not run any extra ordinary commands or settings than Home Assistant provides and my usual schedule for bulb. Although, interior bulbs changes during day on (on/off or brightness) where my exterior bulbs only changes tree times on a day. Sometimes the unresponsive bulbs gets back quite quick and if I issues an command it responds. Rarely, but happens, it takes much longer or I’ve to power cycle.

@djwlindenaar great work again. Thank you very much! Are you sharing your bug finding(s) in IKEA FW with IKEA?

@djwlindenaar:

Of course that light (IKEA) should behave itself and respond with a failure message, but it doesn't and we can't change that.

Well, I have not. Maybe @manup can comment on the success rate of doing this, since I believe he has tried to contact IKEA.
Also I'm not using their latest firmware.

Maybe contacting silicon labs would be a better idea, since that's what the IKEA stuff is built on. I'm not sure if the bugs stem from the Ember code or the IKEA customisation...

@djwlindenaar You could also try to reach out via reddit: https://www.reddit.com/user/TRADFRI They're pretty active there.

@manup Any news for the Conbee II firmware?

after downgrading firmware back to 0x264a0700 I cannot longer connect to Conbee II. Tried downgrading to 0x264a0700 and some really old firmwares as well, flashing works fine but cannot connect. Any advise how to reset the Conbee II stick?

@manup any updates? Should I look for something else than deConz or is work in progress to solve the issues? Please give a sign of life 🤗

I just wish to get my Conbee II working again after trying the test firmware...

@djwlindenaar Any updates from your side? Still a stable network with your fixes?

Good to see that your PR is merged by @manup :)

@djwlindenaar Any updates from your side? Still a stable network with your fixes?

Good to see that your PR is merged by @manup :)

@JBS5 I'm medium happy with the situation.

It has clearly improved for my network. However, I still see the bugs in the IKEA lights confusing deCONZ sometimes.

The key point is that sometimes an IKEA router silently drops packets for a certain route. This silent dropping is illegal, but deCONZ should react to it by finding a new route and it doesn't.

It looks like the changes to the deCONZ firmware do improve the situation, but there is still something to add for these situations. For sure, the absence of an APS ACK should immediately trigger a new route finding.

Note that @manup did mention that source routing could solve the issue and I believe this is especially true in this case, since it means we do not depend on routers in the network knowing how to route to a remote node.

I believe the bug in the IKEA lights is a result of some table not being able to hold all the routes to remote nodes. Thus silently dropping any packet for which it doesn't know the route.

Thanks @djwlindenaar.
As it is a while ago that @manup commented here, do you have any clue if de mentioned source routing is something what will be implemented?

This is a change that needs to happen in the firmware. I'm not affiliated to deCONZ, so I can't comment on the likelihood...

@manup ?

It _is_ causing me to consider moving away from deCONZ for my home network...

@djwlindenaar, what alternatives are you considering? I am currently not impressed with the stability of my Conbee II.

This is a change that needs to happen in the firmware. I'm not affiliated to deCONZ, so I can't comment on the likelihood...

@manup ?

It _is_ causing me to consider moving away from deCONZ for my home network...

Does Zigbee2MQTT do this better or didn't you mean that with the alternative?

Yeah. That's it.

I actually don't know if that does a better job. Note that the firmwares used there are provided by the chipmakers. Those firmwares are not open source. So if they have a firmware issue your probably stuck with less support than deCONZ...
On the other hand the configuration of these firmwares is open source. Also they support source routing already.

Chip maker provides only SDK, if you are so inclined you could download SDK, trial version of simplicity studio and compile firmware yourself. Z2M does provide patches of changes they did to firmware.

Hello together,

Apologies that it took so long, here is the new firmware for ConBee II which has ported all the routing fixes from AVR firmware 0x26350500. Further it has improved startup to prevent the device going silent. (We're still testing out various cases to fix the boot issue for good).

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26580700.bin.GCF

This version will likely be part of the upcoming deCONZ 2.05.76 release, the ConBee I and RaspBee I version 0x26350500 firmware will be raised to be installed by via update button.

Thanks, @manup. Installed this version on my test network and it seems to work. At least devices can be controlled, unlike under 52 and 53. The test network is too small to check whether this version improves the routing.

@manup deCONZ still won't connect to ConBee II, even after flashing the new firmware. Had this problem for a while.

@manup I tried to flash it a few times, but keep getting this error:

Flash update failed, invalid CRC. Please try again.
14:29:06:105 query bootloader v1 ID after 5 ms
14:29:06:122 RX 60 bytes ASCII
R21B18 Bootloader
Vers: 2.05
build: Mar 22 2019
, 14:55:05
 after 22 ms
14:29:06:122 bootloader start after 22 ms
R21B18 Bootloader
Vers: 2.05
build: Mar 22 2019
14:29:06:124 GCF_ResetDeviceDone
14:29:06:125 bootloader v1 update firmware
flashing 160930 bytes: |==============================|
verify: .
Flash update failed, invalid CRC. Please try again.

Is this GCFFlasher 3.13?

Here is the log from my update:

GCFFlasher_internal -d /dev/ttyACM0 -f deCONZ_ConBeeII_0x26580700.bin.GCF 
GCFFlasher V3_13 (c) dresden elektronik ingenieurtechnik gmbh
Reboot device /dev/ttyACM0 (ConBee II)
deCONZ firmware version 26570700
R21B18 Bootloader
Vers: 2.05
build: Mar 11 2019
flashing 160930 bytes: |==============================|
verify: .
SUCCESS
Wait 10 seconds until application starts

Yes it was 3.13, it does work now I've rebooted the full system. Weird, just replugging the ConBee II didn't work, but rebooting does..

Indeed strange, the CRC check is done directly on the device, I'm wondering how this can happen.

@manup I even tried flashing old firmware (multiple versions) and 9600 baudrate. All resulted in the same CRC error.

Happy that it works now, thanks! I already saw one problematic device that joined the network immediately. So I'm hopeful this firmware will fix some issues :)

Glad that it works now. Just had a discussion with the colleges about the bootloader. Version 2.05 was from the first batch of ConBee II in 2019 (about 3500 pcs), this version is a bit rough in some cases. Since July 2019 we ship 2.07 with a few fixes to the watchdog handling.

There is a new bootloader 3.x in development which is already part of RaspBee II, it has a way more robust design and protocol to fix the issues of 2.x version.

Currently we don't update the ConBee II bootloader with GCFFlasher. We haven't rolled that out since there is a subtle chance to brick the device if the bootloader update is aborted in the moment where the startup vector table is modified. But I think we can figure this out by using the ARM hard fault handler. The idea is that if the bootloader update failed and the hard fault handler is triggered, it can check if the bootloader is valid and if that fails it will jump into the application, where the bootloader update can be tried again. We've made some tests with the hard fault handler which look promising, but it will take some time until it's ready for public release.

Hi

I flashed with the new firmware today, but im still seeing logs like:

0x000D6FFFFE540E7C error APSDE-DATA.confirm: 0xA7 on task

Is this related?

0xA7 indicates that an APS ACK has not been provided where it should have been. I guess that could have a variety of reasons.

Hi @manup , good to hear from you again. Do you also have some time to further discuss the remaining routing issues? And hopefully solve them?

Hi again,

Im still having isues where some lights dont react to commands (on/off/dim), and i have really no idea why, i thought that i had something to do with this issue, but now im unsure if it is?

I still get alot of errors like the one i posted 4 days ago.

code count
0xA7 29
0xAE 31
0xD0 1
0xE1 1
0xE9 14
total 76

i have 26 devices posting these errors TODAY, it seems like it have gotten a little worse with 0x26580700

Can someone tell me if this is related to this routing issue? or i should open a issue with my problem

Note, that it seems only to happend when sending fx. "on" to ~20 lights "at the same time"

Hi @manup , good to hear from you again. Do you also have some time to further discuss the remaining routing issues? And hopefully solve them?

Hi @djwlindenaar yes absolutely, I need to catch up the comments in this issue during the week. If my memory serves me well, there were already more ideas to improve the remaining issues.

Hi @djwlindenaar yes absolutely, I need to catch up the comments in this issue during the week. If my memory serves me well, there were already more ideas to improve the remaining issues.

@manup , sure, I think the key issue is still that deCONZ is still stubborn in retaining a route that does not work.
I believe that messages from both the misbehaving and the affected light keep coming in (via some other route) and that is misinterpreted by the firmware as if the non-working route is still working. Also deCONZ tends to give up on APS layer requests quite quickly if no ACK is received. I think it should be more persistent and/or start a route discovery as part of the retry process.

Finally, I am now convinced that source routing would eliminate the main problem of misbehaving routers. So if you could implement that, we would be in a much better place. I'm ready to help/test/debug...

Hi @manup , good to hear from you again. Do you also have some time to further discuss the remaining routing issues? And hopefully solve them?

Hi @djwlindenaar yes absolutely, I need to catch up the comments in this issue during the week. If my memory serves me well, there were already more ideas to improve the remaining issues.

Hi Manup, I may be hijacking this thread so if so please disregard. I have a Conbee ii on Home Assistant and it was working great up until a month or so ago. At that point all my Xiaomi sensors would become unreliable breaking automations. I have some dresden fls-pp controllers that I've had for a couple of years. These are connected to LED strips and used to sporadically drop off the network forcing an reboot to get them back on. I finally removed them all from my Conbee network and immediately the whole network was stable. I left it a few days then re-introduced one which worked for a few hours then overnight my Zigbee network failed and I couldn't get all my switches/sensors back online until I turned off the Dresden controller. No idea why but for me, using the dresden controllers now breaks my zigbee network. I'm only a novice on this so not sure if this is helpful/relevant but I was searching for any comments around this and happened upon this thread so thought I'd throw my experience into the mix just in case. For now I'm just removing them from my network - not worth the headache they were giving me!
Cheers

Hi @djwlindenaar yes absolutely, I need to catch up the comments in this issue during the week. If my memory serves me well, there were already more ideas to improve the remaining issues.

@manup , sure, I think the key issue is still that deCONZ is still stubborn in retaining a route that does not work.
I believe that messages from both the misbehaving and the affected light keep coming in (via some other route) and that is misinterpreted by the firmware as if the non-working route is still working. Also deCONZ tends to give up on APS layer requests quite quickly if no ACK is received. I think it should be more persistent and/or start a route discovery as part of the retry process.

Finally, I am now convinced that source routing would eliminate the main problem of misbehaving routers. So if you could implement that, we would be in a much better place. I'm ready to help/test/debug...

I have decided to move to zigbee2mqtt due to the routing issues (most annoying is the need to re-pair Ikea/Xiaomi occasionally). I will post my findings here in some weeks...

After flashing firmware 0x26350500 on my Conbee I (and updating deCONZ to 2.05.76) last monday, unfortunately a GU10 went offline.

image

In VNC it is a little bit grey:

image

Phoscon:

image

Is it needed to powercyle all the lights before they take advantage of the routing fix in the new firmware version?

@JBS5 , I don't think it is necessary to power cycle lights after the firmware upgrade (unless they were down before the upgrade, they don't magically revive...).
It's probably due to the fact that, although improved, we're not completely clear of routing issues; see my previous post(s).

@djwlindenaar Thanks. I understand.

For what it's worth, because it is kind of the same behaviour as with the previous firmware:

After the GU10 was marked as unavailable, it was still responding to group commands. After a few days, 2 battery powered devices (Aqara motion sensor and Xiaomi/Honeywell smoke detector) went offline either. The GU10 was still responding to group commands.

After power cycling the GU10, also the 2 battery powered devices came back online right away.
So after the GU10 went offline, it took a few days untill a battery powered device using the specific GU10 as router went offline either.

@JBS5 , yes, that sounds like typical behaviour. Actually there's nothing wrong with that GU10 light. It's just that deCONZ lost its way to communicate directly with it. After some time this also causes its end devices to go offline as they also don't get any feedback from deCONZ. Finally as the GU10 gets sufficiently starved of network packets it goes offline for real.

Probably in the phase when it is still responding to group commands, you will be able to power cycle another router, which is apparently fine, and the GU10 will recover. That other router is actually not playing nice and deCONZ is not responding well to the situation.
So don't be angry at the GU10 ;)

Hi @djwlindenaar yes absolutely, I need to catch up the comments in this issue during the week. If my memory serves me well, there were already more ideas to improve the remaining issues.

@manup , sure, I think the key issue is still that deCONZ is still stubborn in retaining a route that does not work.

Hmm the routes should be degraded and dropped after multiple failed transmissions. The latest firmware will degrade every time a an error is reported in the APS-DATA.confirm (like routing error or no APS-ACK received).

I believe that messages from both the misbehaving and the affected light keep coming in (via some other route) and that is misinterpreted by the firmware as if the non-working route is still working.

That's a very good hint, I'll check that. it would explain why routes won't die. I think the only sane way to valuate a route should be based on successful transmissions.

Also deCONZ tends to give up on APS layer requests quite quickly if no ACK is received. I think it should be more persistent and/or start a route discovery as part of the retry process.

deCONZ doesn't do any route discovery this is all handled by the Zigbee stack. We can extend the REST-API to make retries for some commands (like control commands) but this is a though one.

The stack itself could be configured to make multiple APS retries until give up. But this can mess up a lot of things and block the queue for a long time. I think it's best to consider that more fine grained in the REST-API.

Finally, I am now convinced that source routing would eliminate the main problem of misbehaving routers. So if you could implement that, we would be in a much better place. I'm ready to help/test/debug...

In theory source routing is the holy grail to fix it all :)

But in the real world many gateways don't use it (does any?), which means that most products either don't support source routing or weren't tested with it. IMHO chances to get it working in mixed networks are very low. But it's been a while since I've compared various gateways on that level. Would be interesting to have an current overview on which gateways use which routing approach nowadays.

Would be interesting to have an current overview on which gateways use which routing approach nowadays.

Happy to help testing/sniffing Hue bridge (gen 2 and gen 1), innr gateway, IKEA hub, and OSRAM Lightly Home gateway, but I need some input on what test setup to use (how many routers, end devices, distances between them, ...) and what to look for.

Z-Stack FW is one example of FW that offers/uses Source routing and recommends it for larger networks. But I´ve also seen some comments that it does not work really well for the weaker CC2351.

https://github.com/Koenkk/Z-Stack-firmware/tree/master/coordinator

Also deCONZ tends to give up on APS layer requests quite quickly if no ACK is received. I think it should be more persistent and/or start a route discovery as part of the retry process.

deCONZ doesn't do any route discovery this is all handled by the Zigbee stack. We can extend the REST-API to make retries for some commands (like control commands) but this is a though one.

The stack itself could be configured to make multiple APS retries until give up. But this can mess up a lot of things and block the queue for a long time. I think it's best to consider that more fine grained in the REST-API.

The stack does do some retries, I remember it tries three times, if I'm not mistaken. It seems to me that you could add a behavior in that piece of code to invalidate the associated route and retry once or twice again. Or maybe you can have a callback inside the stack to implement that behavior with. That should elicit a route discovery, right?

Finally, I am now convinced that source routing would eliminate the main problem of misbehaving routers. So if you could implement that, we would be in a much better place. I'm ready to help/test/debug...

In theory source routing is the holy grail to fix it all :)

But in the real world many gateways don't use it (does any?), which means that most products either don't support source routing or weren't tested with it. IMHO chances to get it working in mixed networks are very low. But it's been a while since I've compared various gateways on that level. Would be interesting to have an current overview on which gateways use which routing approach nowadays.

To be honest, I'm not very worried about this issue. The behavior of the routers in case of source routing is extremely simple. Especially if you compare it with the behavior required to do routing themselves. Basically the only thing the NWK layer in a router has to do, is to check if source routing is enabled, read the next hop from the packet and hand it back to the MAC layer. No tables, no memory, nothing to do wrong. I'm not sure if the basic behavior is checked for zigbee certification, but for sure it is much simpler and there's much less chance of bugs than normal routing.

My expectation is that few coordinators implement it, because the complexity is in the coordinator firmware. Also very few (consumer) zigbee implementations are actually focused on the larger networks that would benefit from this routing mode.

Finally I'd argue that this is the best routing mode for mixed networks, because normal routing depends on the individual implementations of the routers and especially the poorly specified link quality indications. Since source routing is so simple (=low chance of bad implementations), I would trust that before the normal routing behavior.

So... having said this, I'm ready to test it for you if you could provide a firmware. I'm on a raspbee, my sniffer is ready to go.

Finally I'd argue that this is the best routing mode for mixed networks, because normal routing depends on the individual implementations of the routers and especially the poorly specified link quality indications. Since source routing is so simple (=low chance of bad implementations), I would trust that before the normal routing behavior.

KISS - makes a lot of sense to me.

The stack does do some retries, I remember it tries three times, if I'm not mistaken. It seems to me that you could add a behavior in that piece of code to invalidate the associated route and retry once or twice again. Or maybe you can have a callback inside the stack to implement that behavior with. That should elicit a route discovery, right?

If I understand the code correctly that's already happening, the problem is that routes seem to be kept alive due other factors (under investigation).

To be honest, I'm not very worried about this issue. The behavior of the routers in case of source routing is extremely simple. Especially if you compare it with the behavior required to do routing themselves. Basically the only thing the NWK layer in a router has to do, is to check if source routing is enabled, read the next hop from the packet and hand it back to the MAC layer. No tables, no memory, nothing to do wrong. I'm not sure if the basic behavior is checked for zigbee certification, but for sure it is much simpler and there's much less chance of bugs than normal routing.

I can only speak for for the BitCloud based products I brought through certification (FLS ballasts). In the certification process they were never tested for source routing. I'm not sure if it was even enabled in the stack at compile time. Note that most stacks are configured to be compiled with the minimum required features to safe space.

My personal experience with rarely used/tested features, no matter how simple they are in theory, is that they always have bugs. For example for the FLS we had to fix tons of bugs in the example BitCloud stack application. I know that Philips also made heaps of improvements in the BitCloud stack for some of their products.

My expectation is that few coordinators implement it, because the complexity is in the coordinator firmware. Also very few (consumer) zigbee implementations are actually focused on the larger networks that would benefit from this routing mode.

The complexity needs to be implemented in deCONZ REST-API plugin or a new router plugin, since the firmware doesn't have enough insight about the whole network topology nor the flash space. For that purpose the deCONZ::ApsDataRequest could be extended with a source route option in form of astd::vector<quint16> holding the nwk addresses of a route.

I don't have a good overview of the complexity this brings, which is currently handled by the mesh routing. The following tasks need to be considered at minimum and at scale:

  • Recursive query of network topology (Mgmt_Lqi_req). Thats, the easy one, this already works for mesh routing and could be adapted to source routes.
  • Nodes are powered off. Here some logic needs to select alternate routes, which may also not work if their links are broken too.
  • Nodes are powered on again.
  • Nodes changing the nwk address.
  • End-devices changing parents.
  • End-devices joining, in multi-hop networks we don't know the parent without querying the network.

What we don't know yet:

  • Do all routers support source routing?
  • Are there any commercial gateways which use source routing? Here we need to sniff traffic and look at the NWK headers which hold the source routes and have respective flags set.
  • For the devices which support source routing, how well does it work?

I can only speak for for the BitCloud based products I brought through certification (FLS ballasts). In the certification process they were never tested for source routing. I'm not sure if it was even enabled in the stack at compile time. Note that most stacks are configured to be compiled with the minimum required features to safe space.

I'm not familiar with how this works, but aren't the stacks somehow certified as well?

My personal experience with rarely used/tested features, no matter how simple they are in theory, is that they always have bugs. For example for the FLS we had to fix tons of bugs in the example BitCloud stack application. I know that Philips also made heaps of improvements in the BitCloud stack for some of their products.

My expectation is that few coordinators implement it, because the complexity is in the coordinator firmware. Also very few (consumer) zigbee implementations are actually focused on the larger networks that would benefit from this routing mode.

The complexity needs to be implemented in deCONZ REST-API plugin or a new router plugin, since the firmware doesn't have enough insight about the whole network topology nor the flash space. For that purpose the deCONZ::ApsDataRequest could be extended with a source route option in form of astd::vector<quint16> holding the nwk addresses of a route.

I don't understand this statement. Source routing is completely handled at NWK level in the stack. The bitcloud stack should have some compile time (or less likely runtime) flag to enable source routing. There's a source route table which is kept up-to-date by the stack, based on the Route Record messages which are already in our network due to the MTORR sent by the coordinator every few minutes.

  • Basically for each device in the network the route to the coordinator is known through MTORRs
  • The coordinator knows each (complete) route to each device through the Route Record messages. No analysis is needed, just a copy-past of the RR message into the source route table

See the zigbee specification chapter 3.6.3.5.4 and chapter 3.6.3.3

I don't have a good overview of the complexity this brings, which is currently handled by the mesh routing. The following tasks need to be considered at minimum and at scale:

* Recursive query of network topology (Mgmt_Lqi_req). Thats, the easy one, this already works for mesh routing and could be adapted to source routes.

* Nodes are powered off. Here some logic needs to select alternate routes, which may also not work if their links are broken too.

* Nodes are powered on again.

* Nodes changing the nwk address.

* End-devices changing parents.

* End-devices joining, in multi-hop networks we don't know the parent without querying the network.

I believe these are all handled by the NWK layer inside the bitcloud stack. It should be as simple as adjusting some compile time flags (similar to enabling the MTOR behavior)

What we don't know yet:

* Do all routers support source routing?

I'd expect so, since this is part of the basic zigbee specification of the NWK layer. I can't be certain, but I did not see any remark that supporting source routing is optional. It's similar to MTOR behavior, which we already use in our network.

* Are there any commercial gateways which use source routing? Here we need to sniff traffic and look at the NWK headers which hold the source routes and have respective flags set.

Don't know this one, also don't know what that will tell us?

* For the devices which support source routing, how well does it work?

If you could make a build of the firmware with source routing enabled, we will learn quickly. It should just be a few compile time flags to enable it in bitcloud.

Also Zigbee2MQTT has it enabled for some coordinator firmwares and I didn't yet (after a quick google) find any issues with specific routers not working well with source routing enabled.

I'm not familiar with how this works, but aren't the stacks somehow certified as well?

Yes, but during certification specific configurations are enabled which may not be enabled in the products later on.

I don't understand this statement. Source routing is completely handled at NWK level in the stack. The bitcloud stack should have some compile time (or less likely runtime) flag to enable source routing. There's a source route table which is kept up-to-date by the stack, based on the Route Record messages which are already in our network due to the MTORR sent by the coordinator every few minutes.

Basically for each device in the network the route to the coordinator is known through MTORRs
The coordinator knows each (complete) route to each device through the Route Record messages. No analysis is needed, just a copy-past of the RR message into the source route table

See the zigbee specification chapter 3.6.3.5.4 and chapter 3.6.3.3

Ah you're right, silly me I somehow misinterpreted it with general node discovery (facepalm).
The stack can indeed figure out a lot due route record commands, but I already run into a case where none were sent, see below.

Today I've tried many configurations, it seems that source routing works kind of, but only when mesh routing is disabled and for devices which send route record commands prior.

image

My Philips Hue motion sensor makes trouble and sends broadcast NWK address requests to get the gateway NWK address. In this case no route record frames were send a priory (I think because of the broadcast). Since mesh routing was disabled the gateway didn't start a route discovery and so communication to the sensor wasn't established.

When mesh routing is enabled next to source routing it seems that source routing isn't used anymore albeit both are enabled.

I need to do more tests, I think mesh route discovery should be prevented when there are already route record entries in the route cache. The code is quite complex, I need to dig deeper here.

deCONZ_ConBeeII_0x26600700_many2one.bin.zip

Here is the firmware which has both mesh and source routing enabled (route record table size of 32). You may try it but as mentioned in my smaller network source routing wasn't triggered in that configuration.

I'm on raspbee I, so can't test that one. Could you create one for that as well?
I'm wondering whether we could work around this issue with a kind of static route which we can upload to the firmware at startup or maybe indeed based of some other information.
Also, did you repair this device after the change to source routing? Wondering whether there is some interaction needed for an end device to recognise MTOR and source routing is in use. Of course they don't receive the broadcast MTOR messages when their receiver is off.
BTW I did see f.e. my Xiaomi end devices send the route record in my sniffing...

I've led above firmware run over night. It seems source routing is used also with mesh routing enabled, but very rarely. From ca. 2 million packets only < 5 used a source route.

I'm on raspbee I, so can't test that one. Could you create one for that as well?

Not sure, I need to check the other stack configuration used by ConBee I and RaspBee I if it's possible.

I'm wondering whether we could work around this issue with a kind of static route which we can upload to the firmware at startup or maybe indeed based of some other information.

Yes I think it should be possible to inject routes into the route cache somehow. But when we are back to my concerns mentioned above to maintain routes. Anyway for testing purposes and fixed networks it would provide a useful way to configure routing.

Also, did you repair this device after the change to source routing? Wondering whether there is some interaction needed for an end device to recognise MTOR and source routing is in use.

No I just updated the firmware, no repairing – source routes were established shortly after receiving the route record commands.

BTW I did see f.e. my Xiaomi end devices send the route record in my sniffing...

IKEA seems to fit in here too, I'd like to do more tests with Philips and other brand devices to get a more details how various devices can be used with source routing.

I could chime in as well for testing if we can make it work for Raspbee.

I've led above firmware run over night. It seems source routing is used also with mesh routing enabled, but very rarely. From ca. 2 million packets only < 5 used a source route.

Interesting! Would be nice to get an insight into the decision making logic when a source route is used and whether it can be influenced/tuned somehow.

I'm on raspbee I, so can't test that one. Could you create one for that as well?

Not sure, I need to check the other stack configuration used by ConBee I and RaspBee I if it's possible.

Would be great...

I'm wondering whether we could work around this issue with a kind of static route which we can upload to the firmware at startup or maybe indeed based of some other information.

Yes I think it should be possible to inject routes into the route cache somehow. But when we are back to my concerns mentioned above to maintain routes. Anyway for testing purposes and fixed networks it would provide a useful way to configure routing.

Agreed

Also, did you repair this device after the change to source routing? Wondering whether there is some interaction needed for an end device to recognise MTOR and source routing is in use.

No I just updated the firmware, no repairing – source routes were established shortly after receiving the route record commands.

Wondering if it would help for the Philips device...

Hello guys, I´ve been running the Z2M Source Routing firmware for 24 hours (only allows 5 direct connects). I have ~35 devices. It works fine but I´ve noticed that the Hue devices need a power cycle to reconnect after I moved from the default FW to the SR firmware. Ikea and Xiaomi reconnnected automatically. Maybe this can explain parts of the Hue behaviour you´re seeing?

Hi,
I retested my TRADFRI bulb E14 WS opal 400lm with ikea firmware 2.3.050 and the latest deconz beta and latest firmware on my Conbee I. I can confirm bulbs and communication has improved. Scene recalling, switching on/off, fading in and out now works better but there is an issue left.

The issues documented here #2518 have been mostly resolved. It seems like group commands (out of phoscon gui) work even better. #2518 closed

I have noticed yet that changing CT sometimes needs to be triggered more than once for success if changed by hand not scene.

A new issue #2892 that appears is that if within a group a scene is recalled and all belonging bulbs are being turned on (due to the scene) and later another scene in that group with only a few bulbs turned on (defined by scene) is recalled, the "unwanted" bulbs that should turn off stay on.
Furthermore if you turn the group off only the current defined scene bulbs turn off and the others from before (previous scene from the group) stay on. They need several off commands before going off.
The issue is described here but it is in the API for sure as it does not make any difference if using Phoscon or the api itself. https://github.com/dresden-elektronik/phoscon-app-beta/issues/52

additional note: using latest ikea fw 2.3.050 release at 20.5.2020

Release Version: 1.12.1
20th May 2020
New features and changes in Accessories:
WS 1.0 (E14, E27, GU10) V-2.3.050.
Fixed On Off State after OTA Upgrade.

Thanks for your continuous improvement and work spent.

@manup , any news for the source route firmware version of conbee I ? I'm looking forward to test that :)

@manup , any news for the source route firmware version of conbee I ? I'm looking forward to test that :)

My first attempt to enable it ended in a horrible compile time error mess :/ Haven't figured out yet what's the cause and hope to get it to compile somehow.

@manup , any news for the source route firmware version of conbee I ? I'm looking forward to test that :)

My first attempt to enable it ended in a horrible compile time error mess :/ Haven't figured out yet what's the cause and hope to get it to compile somehow.

Hmmmm does not sound great... @manup , I would love to help, but I don't have access to that code.

@manup , any progress?

@manup , pssst! 😁

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Still happening. @manup any progress and/or update on this issue?

@djwlindenaar @JBS5 I've asked @manup for a update.

Yes, still an issue - even though much less than before.

What is the latest on the issue?

I have about 20 white spectrum GU10 (first gen) connected to a Hue bridge and as part of my zigbee network rationalisation I wanted to migrate all my devices to Deconz.

Hue is quite stable but the lights still go offline from time to time (e.g. once a week one light goes offline). I just need to restart the bulbs to resuscitate them.

Is Deconz more or less stable?

No Deconz is not more stable, I also have about 40 Ikea lamps and sometimes some go offline

@salopette I do not have that experience. Would you mind opening a seperate issue? Any logs?

@Mimiix This issue is about lights that goes offline. Why opening a new issue?

Same here. Lights still go offline once in a while. But it has much improved over time. When I started using deCONZ in 2018 I had to powercycle my ikea lights pretty much daily. I most often experience this issue with a FLOALT Panel but this is also the light furthest away from my Conbee Stick.

@JBS5 Because we don't know anything about this user's setup. No version, no logs, no information on the setup. We are in no position to determine if his issue is related to this.

Adding comments without any information is not helping this in a case. That's why i want seperate issues so we have a better overview.

I am once asking @manup in private to look into this and prioritize it.

@djwlindenaar has been digging deep into it. I doubt inspecting another users setup can do anything about an issue which seems to exist in general with ikea lights. I would would rather concentrate on the existing investigation than starting another.

General anouncement:

I just spoke with @manup in private on this. He told me the following:

```That the firmware will get an update in regards to the routing issues, and the new network should hopefully help to recreate the issues (which wasn't possible before).
My plan is that this next firmware is available within the next 2–3 weeks.

Some changes are already done but are not public yet.
```

I will keep you all posted.

General anouncement:

I just spoke with @manup in private on this. He told me the following:

My plan is that this next firmware is available within the next 2–3 weeks.

Some changes are already done but are not public yet.

I will keep you all posted.

3 weeks later, any update on this?

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)

Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)

Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)

So what is the ETA now the 2-3 weeks have passed

@lubbertkramer Please be reasonable here, the stale bot was a joke. The only promise @Mimiix made previously is to share updates once some are available, so it's perfectly valid to ask about the current status. Now for the ETA, none has been promised and I understand none will be, as this is complex stuff. As I know manup when talking about nasty things, it's unfortunately nothing which can be sorted out within a day, only show after considerable time or evolve some unforeseen side effects.

So in that sense, please be patient and let the guys work on it (it's still vacation time by the way). When speaking of the next release, that's in 1,5 weeks to raise your hand and ask for some news.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)

Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Reboot issues are not related to this issue I guess?
What about the routing changes and the mentioned debug enhancements? Are those debug enhancement related to this issue?

And are those routing changes based on @djwlindenaar findings or is @manup already able to be some more specific?

@lubbertkramer Please be reasonable here, the stale bot was a joke. The only promise @Mimiix made previously is to share updates once some are available, so it's perfectly valid to ask about the current status. Now for the ETA, none has been promised and I understand none will be, as this is complex stuff. As I know manup when talking about nasty things, it's unfortunately nothing which can be sorted out within a day, only show after considerable time or evolve some unforeseen side effects.

So in that sense, please be patient and let the guys work on it (it's still vacation time by the way). When speaking of the next release, that's in 1,5 weeks to raise your hand and ask for some news.

I think its more then reasonable to hold people to passed communication or promises without making “jokes” when 2-3 weeks have passed by 3 weeks without follow-up. Users who have investigated a lot and replied here have already moved on because of the lack of follow-up/communication.

So back on the issue what is the latest Information about this issue?

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)

Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)

So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.

So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.

_Be the change you want to see in this world_

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.

So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.

_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.

The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.
_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.

The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your emotions under control. I understand your anger, but being disrespectedful is not helping. I do not accept any flaming here. Keep it on topic and civil. I just explain what my arguments are and why I do stuff the way I do.

If you want to start a issue to complain, please be my guest. Just don't do it in this issue.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.
_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.
The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your emotions under control. I understand your anger, but being disrespectedful is not helping. I do not accept any flaming here. Keep it on topic and civil. I just explain what my arguments are and why I do stuff the way I do.

If you want to start a issue to complain, please be my guest. Just don't do it in this issue.

The only embarrassment in this issue is manup and Dresden obviously not capable of fixing this issue. Conbee is a commercial product and with that comes responsibilities. Which is why many of us ha already left deCONZ/Conbee for TI chip and Z2M, which has been working flawlessly ever since.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.
_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.
The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your emotions under control. I understand your anger, but being disrespectedful is not helping. I do not accept any flaming here. Keep it on topic and civil. I just explain what my arguments are and why I do stuff the way I do.

If you want to start a issue to complain, please be my guest. Just don't do it in this issue.

And again, you are not answering or updating the status which is asked many times and the only thing you are doing is moving on with an offtopic discussion where you already could have given an update as asked on the end of every post. You are bragging about about direct connections/invites in the development meetings, so you should be up to speed about the status of this issue and could update all the reading users here.

So once again post an update or stop posting offtopic, i guess that is your job as community moderator instead of keeping the offtopic discussion alive :)

General anouncement:

I just spoke with @manup in private on this. He told me the following:

My plan is that this next firmware is available within the next 2–3 weeks.

Some changes are already done but are not public yet.

I will keep you all posted.

The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up, what can we users expect?

A new reliese is about to come in less that a week. I suggest to keep calm
and wait to see what the new changes will bring

I have lots of tradfri's working without problems for more than a year.
Resently, I started to have problems with only one of my Tradfri lights
started dropping and connecting to the mesh about every 15 minutes. 15min
reachable , 15 min notreachable. Did alot of testing to find the problem.
Guess what.... the problems was? It was not from deCONZ but my WiFi
repeater schedule a resently put on it.

Don't be so sure the problems is always deCONZ related, sometimes its not.

On Sun, Sep 6, 2020, 11:38 lubbertkramer notifications@github.com wrote:

@JBS5 https://github.com/JBS5 Not 3 weeks yet. Waited for the stale bot
here ;)
Pmed @manup https://github.com/manup this morning again. Got the
following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you
to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already
active since february 2019. When you want to be the spokeperson for Dresden
act like it or let just @manup https://github.com/manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this.
I am in no way a spokesperson, I am sticking out my neck here to get things
sorted as I have a direct connection. The stale bot I use to keep track of
issues and I do get the notification. I expect manup to come back to me and
if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the
community, you would know better than that. Also to add that, it's not that
I haven't had other things to do in life.
Be the change you want to see in this world

Nobody is asking or forcing you to be the middle man as user so once again
instead of flaming and changing flower power leave this issue as
spokeperson or get back to the issue and the latest intel.
The 2-3 weeks have passed which you have posted as general announcement by
more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your
emotions under control. I understand your anger, but being disrespectedful
is not helping. I do not accept any flaming here. Keep it on topic and
civil. I just explain what my arguments are and why I do stuff the way I do.

If you want to start a issue to complain, please be my guest. Just don't
do it in this issue.

And again, you are not answering or updating the status which is asked
many times and the only thing you are doing is moving on with an offtopic
discussion where you already could have given an update as asked on the end
of every post. You are bragging about about direct connections/invites in
the development meetings, so you should be up to speed about the status of
this issue and could update all the reading users here.

So once again post an update or stop posting offtopic :)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-687726432,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AO4LI5G32N546L6GKHSHISLSENDABANCNFSM4GXPM2DA
.

@morfei1 and @Mimiix The next release is hopefully coming in around 2 weeks if you have looking how the releasing is working by DE ("15th" = little before getting payed). And its no word if one fix is being in it with or without holidays.
And is some have saying (and I many times) its how the oficial support is working then customers having large problems.
Off the record: ZHA is in 1 - 2 weeks going official support for Silabs EZSP all versions of their stack protocols so i think one Sonoff Zigbee Bridge or Elelabs-ELU013 / ELR023 is better putting the money on and getting more RF and SoC power then the DE products and better support from the community.
But i'm waiting for deCONZ REST-API version 2 before letting all DE products RIP.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.
_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.
The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your emotions under control. I understand your anger, but being disrespectedful is not helping. I do not accept any flaming here. Keep it on topic and civil. I just explain what my arguments are and why I do stuff the way I do.
If you want to start a issue to complain, please be my guest. Just don't do it in this issue.

And again, you are not answering or updating the status which is asked many times and the only thing you are doing is moving on with an offtopic discussion where you already could have given an update as asked on the end of every post. You are bragging about about direct connections/invites in the development meetings, so you should be up to speed about the status of this issue and could update all the reading users here.

So once again post an update or stop posting offtopic, i guess that is your job as community moderator instead of keeping the offtopic discussion alive :)

General anouncement:
I just spoke with @manup in private on this. He told me the following:

My plan is that this next firmware is available within the next 2–3 weeks.

Some changes are already done but are not public yet.

I will keep you all posted.

The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up, what can we users expect?

To answer your question directly: That's what @manup told me after i asked him again. I can't force anyone to do something. I'll ask again after the weekend.

This will be a general last warning on your comments. They do not show any respect and are very disparaging towards me and anyone running this community. The next off-topic comment here is getting this issue locked until there's more news. Feel free to open a second issue concerning the handling of issues.

@JBS5 Not 3 weeks yet. Waited for the stale bot here ;)
Pmed @manup this morning again. Got the following response:

Have been crawling in the firmware the last couple of days to investigate the reboot issues and found some nasty bugs.
It will also contain a few routing changes, hope to get it out with the coming release.

The debug enhancements will follow after the next stable release.

I asked for a ETA and he is on it at this moment.

Well you made a promise towards the community so its normal they hold you to that promise, 21 days / 7 days = 3 weeks
then it's lame to hide behind the stalebot when this issue is already active since february 2019. When you want to be the spokeperson for Dresden act like it or let just @manup handle this :)
So what is the ETA now the 2-3 weeks have passed

I have no problem in leaving this issue behind me when you act like this. I am in no way a spokesperson, I am sticking out my neck here to get things sorted as I have a direct connection. The stale bot I use to keep track of issues and I do get the notification. I expect manup to come back to me and if the time frame passed, the stale bot helps me remind.
So no, it's not "lame" to hide behind it. If you did what I did for the community, you would know better than that. Also to add that, it's not that I haven't had other things to do in life.
_Be the change you want to see in this world_

Nobody is asking or forcing you to be the middle man as user so once again instead of flaming and changing flower power leave this issue as spokeperson or get back to the issue and the latest intel.
The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up?

All I had was what I mentioned. Please let me advise you to keep your emotions under control. I understand your anger, but being disrespectedful is not helping. I do not accept any flaming here. Keep it on topic and civil. I just explain what my arguments are and why I do stuff the way I do.
If you want to start a issue to complain, please be my guest. Just don't do it in this issue.

And again, you are not answering or updating the status which is asked many times and the only thing you are doing is moving on with an offtopic discussion where you already could have given an update as asked on the end of every post. You are bragging about about direct connections/invites in the development meetings, so you should be up to speed about the status of this issue and could update all the reading users here.
So once again post an update or stop posting offtopic, i guess that is your job as community moderator instead of keeping the offtopic discussion alive :)

General anouncement:
I just spoke with @manup in private on this. He told me the following:

My plan is that this next firmware is available within the next 2–3 weeks.

Some changes are already done but are not public yet.

I will keep you all posted.

The 2-3 weeks have passed which you have posted as general announcement by more then 3 weeks now, so what is the follow-up, what can we users expect?

To answer your question directly: That's what @manup told me after i asked him again. I can't force anyone to do something. I'll ask again after the weekend.

This will be a general last warning on your comments. They do not show any respect and are very disparaging towards me and anyone running this community. The next off-topic comment here is getting this issue locked until there's more news. Feel free to open a second issue concerning the handling of issues.

That’s a bit of an answer, so when I read correctly we can expect a more detailed answer later this week because then almost 4 weeks have passed instead of the 2-3 that @manup had communicated through you as general announcement that the update would be available

@Mimiix if you see the urge of closing this issue that is almost open for two years with lots of information from very supportive and dedicated users without a solution then I’m not the person that can hold you back, that’s up to you as community spokesperson:)

@lubbertkramer I said Lock, not close.

And with that, I will Lock this issue. I forwarded the issue to @manup. I will unlock in a few days or when Manuel replied here.

Hi again, guess it's more than time for an update on the issue.

But first things first, if anyone is to blame here it's me and only me. I understand the frustration but there is no room for being unpolite here towards @Mimiix, therefore lets be civil guys, if it wasn't for him and all he has achieved for this community I wouldn't had have the time to get stuff done in deCONZ core and the firmware and working on bug fixes and improvements. It was easier in the past but since this whole thing scaled like crazy, todo lists, support calls and emails grew as well like crazy, with the reboot loop bugs being my personal hell. Thanks to the moderators and all the awesome devs and contributors things look way better now, to tackle deeper issues down, one at a time.

Progress update

(Preview for the next release)

I've been testing a lot with source routing and did experiment with the findings in this issue and various sniffer logs. Spend a lot of time to get "automatic" source routing working reliably, based on incoming Route Record commands. Which is basically what most implementations with sourcing routing do — like the source routing firmware for zigbee2mqtt. It kind of works sometimes, but there seems to be a bigger dynamic on how/if a source route is chosen (and kept) based on the LQI/RSSI values which a Route Record hop sees from it's neighbors, in mixed networks this is tricky because differences in the stack and hardware. The link quality can also be quite dynamic if people move around between nodes and doors are opened and closed, which unfortunately affects routing as well.

Therefore I've experimented with the idea to configure fixed source routes towards a destination node. In many cases only a part of the network has routing problems, here it can be beneficial to specify a fixed source route where:

  • Source routes can be optionally specified on per node basis.
  • Every hop should always be powered.
  • The hop positions through the path should be reasonable. In some cases a user might have a better plan where packets should route than what the automatic algorithm is able to figure out.

To do that the firmware protocol has been extended to optionally provide a source route per APS request. There is no limit on how many source routes can be configured, the firmware only needs to keep a few in memory for the requests and the ACK.

deCONZ automatically figures out if each hop on the route is reachable and only in that case uses the source route.
Behind the scenes source routes are configured with MAC addresses of the nodes to be resistant against NWK address changes.

Creating a fixed source route

  • In the deCONZ gui select all hops while holding CTRL (multi selection) beginning from coordinator (source) up to the destination node. Important there needs to be at least one hop between source and destination.
  • Right click on the destination node to open the context menu.
  • Select "Add source route".

This will create the new source route, visualized by the blueish line, the red end marks to the destination node.

image

(In later releases it will be possible to specify alternate fallback routes but I need to figure out a good ui for that.)

To remove a source route: Select "Remove source route" in the destination node context menu.

The route will be used for the very next request:

image

image

This works pretty well and allows manual overwrite of routing if needed, should be handy for testing as well.
Currently this only works when configured via GUI, but it should be available via REST-API too later on.

Will this fix everything?

Unlikely, but it should improve routing towards destinations (the routing on the nodes itself can't be configured). Note fixed source routes are good for routers only, since end-devices tend to change the parent and the source route wouldn't work anymore, in this case the automatic source routing might work better.

When will it be released?

Soon, in the next 2.05.81 release, I have the following (rather light) items on the todo list for the release:

  • Store/restore source routes in database.
  • Fix handling of NWK security counter to fix backup between different ConBee/RaspBee and reboot nothing works bug.
  • Let GCFFlasher verify firmware is running after update.

So with @manup his response, a answer has been given. The issue will be unlocked. Please remain on-topic and on subject.

Hi again, guess it's more than time for an update on the issue.

Progress update

(Preview for the next release)

I've been testing a lot with source routing and did experiment with the findings in this issue and various sniffer logs. Spend a lot of time to get "automatic" source routing working reliably, based on incoming Route Record commands. Which is basically what most implementations with sourcing routing do — like the source routing firmware for zigbee2mqtt. It kind of works sometimes, but there seems to be a bigger dynamic on how/if a source route is chosen (and kept) based on the LQI/RSSI values which a Route Record hop sees from it's neighbors, in mixed networks this is tricky because differences in the stack and hardware. The link quality can also be quite dynamic if people move around between nodes and doors are opened and closed, which unfortunately affects routing as well.

Therefore I've experimented with the idea to configure fixed source routes towards a destination node. In many cases only a part of the network has routing problems, here it can be beneficial to specify a fixed source route where:

  • Source routes can be optionally specified on per node basis.
  • Every hop should always be powered.
  • The hop positions through the path should be reasonable. In some cases a user might have a better plan where packets should route than what the automatic algorithm is able to figure out.

To do that the firmware protocol has been extended to optionally provide a source route per APS request. There is no limit on how many source routes can be configured, the firmware only needs to keep a few in memory for the requests and the ACK.

deCONZ automatically figures out if each hop on the route is reachable and only in that case uses the source route.
Behind the scenes source routes are configured with MAC addresses of the nodes to be resistant against NWK address changes.

Creating a fixed source route

  • In the deCONZ gui select all hops while holding CTRL (multi selection) beginning from coordinator (source) up to the destination node. Important there needs to be at least one hop between source and destination.
  • Right click on the destination node to open the context menu.
  • Select "Add source route".

This will create the new source route, visualized by the blueish line, the red end marks to the destination node.

image

(In later releases it will be possible to specify alternate fallback routes but I need to figure out a good ui for that.)

To remove a source route: Select "Remove source route" in the destination node context menu.

The route will be used for the very next request:

image

image

This works pretty well and allows manual overwrite of routing if needed, should be handy for testing as well.
Currently this only works when configured via GUI, but it should be available via REST-API too later on.

Will this fix everything?

Unlikely, but it should improve routing towards destinations (the routing on the nodes itself can't be configured). Note fixed source routes are good for routers only, since end-devices tend to change the parent and the source route wouldn't work anymore, in this case the automatic source routing might work better.

When will it be released?

Soon, in the next 2.05.81 release, I have the following (rather light) items on the todo list for the release:

  • Store/restore source routes in database.
  • Fix handling of NWK security counter to fix backup between different ConBee/RaspBee and reboot nothing works bug.
  • Let GCFFlasher verify firmware is running after update.

Thanks for the detailed response, i think a lot of us are waiting for a response like above because of all the hard work the community is doing for this issue and you are doing as developer.

Several follow-up questions that i have:

  • When will above be available as an update in beta/stable?

  • Will there be a difference in how above will work regarding conbee 1/2 and Raspbee or will this be all the same?

  • For update 2.05.81 you mention to have rather (light) items on the todo list, when can we expect that update (maybe it's an idea to add a roadmap to this git?)

Hi @lubbertkramer

I can answer number 1: Version 2.05.81 would be expected at the 15th of the month (Windows build a few days later). I've placed that on announcement earlier in the Discord, but i just figured that it isn't the case for the Github. I will add that to the readme! My bad!

Edit: Did a Pull request of the readme.md file. I am not sure how the stable channel is running that needs to be clarified a bit.

Will there be a difference in how above will work regarding conbee 1/2 and Raspbee or will this be all the same?

It should work the same, I have now ported the code over to RaspBee I / ConBee I and source routes are used there in the same manner.

I have currently a curious case where a IKEA repeater plug is alive but doesn't want to play with the rest of the network.
Commands send to it (with and without source routes) are ignored silently.

The repeater sends its NWK Link Status frames, but reports and empty neighbor list:

image

Commands from the coordinator as well as a OSRAM sensor of which the repeater is the parent are ignored. The sensor, like many Xiaomi devices, doesn't try to find a new parent automatically... bad situation:

image

This scenario runs for two days now, haven't found a way to recover the repeater. Perhaps a power-cycle of the repeater will fix the problem, but I'll keep it like this for now to see if it can be resolved somehow via air.

There is problem with the Tradfri USB repeater or the Tradfri plug?

In this case it was the plug, I'm not sure if it is at the latest version, need to check this later.

Is it possible to check? I have 3 Tradfri plugs running latest FW rock solid for about 1.5year. If I start to have troubles with them with the new beta, I have to delay the update.

Thanks!

image

Here is the data from basic cluster, note the problem first occurred two days ago in my home network which didn't had the new firmware at the time.

It looks like you are on the latest FW as my plugs. I hope it to be, because of the Ikea old FW...

The webpage with the tradfri change log is down, to check what the latest FW fixed.

Latest firmware files are online now:

ConBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26650700.bin.GCF

RaspBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_RaspBeeII_0x26650700.bin.GCF

RaspBee I and ConBee I

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26370500.bin.GCF

  • These have source routing enabled with max. 32 source routes, where the oldest entries are replaced automatically by newer ones. The amount will likely be increased in later releases, but should already work well.
  • If a source route and a "normal" route entry exists the source route is preferred. This is non-standard but seems to work better.
  • The firmware has general improvements for serial protocol robustness.
  • ConBee II and RaspBee II got improved frame counter handling, to mitigate problems where after a power-cycle the network could have appeared to be lost until the frame counter hits its old value again.

@manup did firmware for ConBee I dropped version command id 0x0D? I'm getting no response to this command anymore

@Adminiuga @manup Things about the firmware: Please use #3260 .

Latest firmware files are online now:

ConBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_ConBeeII_0x26650700.bin.GCF

RaspBee II

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_RaspBeeII_0x26650700.bin.GCF

RaspBee I and ConBee I

http://deconz.dresden-elektronik.de/deconz-firmware/deCONZ_Rpi_0x26370500.bin.GCF

  • These have source routing enabled with max. 32 source routes, where the oldest entries are replaced automatically by newer ones. The amount will likely be increased in later releases, but should already work well.
  • If a source route and a "normal" route entry exists the source route is preferred. This is non-standard but seems to work better.
  • The firmware has general improvements for serial protocol robustness.
  • ConBee II and RaspBee II got improved frame counter handling, to mitigate problems where after a power-cycle the network could have appeared to be lost until the frame counter hits its old value again.

How do I flash this firmware (I use marthoc's deCONZ Docker image)?

I have downloaded the file (ConBee II) but the manual update instructions don't apply to the marthoc's deCONZ Docker image and the guide for marthoc's deCONZ doesn't allow to use a firmware not included in the image (and this file is not listed as available).

How do I flash this firmware (I use marthoc's deCONZ Docker image)?

I just plug the USB dongle into a Windows laptop and do it from there and back into my Synology NAS when it's done.

How do I flash this firmware (I use marthoc's deCONZ Docker image)?

I just plug the USB dongle into a Windows laptop and do it from there and back into my Synology NAS when it's done.

Is there a Windows utility to flash the firmware? If so, could you share the link?

How do I flash this firmware (I use marthoc's deCONZ Docker image)?

I just plug the USB dongle into a Windows laptop and do it from there and back into my Synology NAS when it's done.

Is there a Windows utility to flash the firmware? If so, could you share the link?

https://github.com/dresden-elektronik/deconz-rest-plugin/wiki/Update-deCONZ-manually#update-in-windows

Therefore I've experimented with the idea to configure fixed source routes towards a destination node. In many cases only a part of the network has routing problems, here it can be beneficial to specify a fixed source route where:

  • Source routes can be optionally specified on per node basis.
  • Every hop should always be powered.
  • The hop positions through the path should be reasonable. In some cases a user might have a better plan where packets should route than what the automatic algorithm is able to figure out.

Will this fix everything?

Unlikely, but it should improve routing towards destinations (the routing on the nodes itself can't be configured). Note fixed source routes are good for routers only, since end-devices tend to change the parent and the source route wouldn't work anymore, in this case the automatic source routing might work better.

Just to clarify... is this new firmware going to fix (or at least improve) the random disconnection of the Ikea devices if I don't manually configure any new static source routes? Or the only way to benefit from this update is to manually set the routes of those devices that tend to loose connection?

Without manual setting routes the automatic source routing takes place if devices promote them (like IKEA devices do).
So in comparison to the former firmware source routing will be used for these devices if such a route is known.

If it really helps remains open, would be good if anybody who had regularly problems with the former firmware can report if anything has changed.

In my test networks the firmware works quite well and source routes are used often as indicated by the sniffer logs. That doesn't mean that much since even without source routing my networks were pretty solid.

I have 21 lights (4 - 980lu WS, 17 - 1000lu WS), 3 plugs and 3 (5 button) round remotes of Ikea. All of them on the latest Ikea FWs. I've never experience unstability with them for the last about +1.5 year. Nighter with any of the previous deCONZ versions or FW releases nor with the currently one. I'm running stable .81 with the latest RaspBee FW 0x26370500, and aslo do not experience any problems for the last week of using them.

I guess.... (maybe Im not quite right) Ikea does not have a general problem, but is setup specific. There are lots of different factors that can affect their behavior and preformance in deCONZ and in general.

I think most issues occurred with the Ikea GU10 Lamps. Stability had vastly improved over time, but it was a mess in late 2018 when I started. I had to powercycle one of 30 GU10 every two weeks on average for the past three months release/firmware.

I think most issues occurred with the Ikea GU10 Lamps. Stability had vastly improved over time, but it was a mess in late 2018 when I started. I had to powercycle one of 30 GU10 every two weeks on average for the past three months release/firmware.

Could be. The GU10 have not recieved FW update as other lights of Ikea, as far as I know. Recently, we discused it in the discord channels. A user running GU10 Ikea lights experience such behaviour. No mattery what we tried, nothing seems to slove his problemes. He even shared with us he bought an Ikea Tradfri Hub and he also have the same bad experience, even with the Ikea Hub and GU10 light/s.

So, I guess it's a matter of time Ikea releases a new FW for these type of lights to maybe slove/fix this problem....

I am wondering if @djwlindenaar has already updated to the latest deCONZ and firmware and able to share his experiences? :)

Actually not yet... I didn't find the time yet to upgrade. Also I didn't find the time to switch to z2mqtt, so the good news is that I will upgrade to the new firmware and if there's anything to report, I will.

If it really helps remains open, would be good if anybody who had regularly problems with the former firmware can report if anything has changed.

I've updated to the latest firmware 2 days ago and since then one my Xiaomi wall switches (ctrl_neutral2, 11-25-2017) toggled on its own 3 times. I have never had such issue before.

It is connected to the coordinator via a Tradfri bulb E27 CWS on 1.3.009.

Additionally, not strictly related to this issue, but I never managed to get an OTA update with Deconz (community container).

I have the Tradfri bulbs:

  • E27 CWS on 1.3.009
  • E14 W on 1.2.214
  • Wireless dimmer on 1.2.248
  • 17 * GU10 WS on 1.2.221

I can see the Trafri OTA files downloaded but nothing happens. What should I do?

For reference, this is how the container starts:

docker run -d --name=deconz --net=host --restart=always -v /etc/localtime:/etc/localtime:ro -v /opt/otau:/root/otau -v /opt/deconz:/root/.local/share/dresden-elektronik/deCONZ --device=/dev/ttyACM0 -e DECONZ_WEB_PORT=801 -e DECONZ_WS_PORT=445 -e DECONZ_VNC_PORT=5930 -e DECONZ_VNC_PASSWORD=... -e DECONZ_VNC_MODE=1 -e DEBUG_INFO=1 marthoc/deconz

@ReX1983 This isn't related to this issue here afaik. Please open a seperate issue. I think you should post in the Docker version repo. https://github.com/marthoc/docker-deconz

Moderation note:

Before everything gets mixed up here, i'd like to set a scope here.

Anything related to the original issue: IKEA lights occassionally losing connection is allowed in this issue.

Any other issue related to the firmware can be posted here

Actually not yet... I didn't find the time yet to upgrade. Also I didn't find the time to switch to z2mqtt, so the good news is that I will upgrade to the new firmware and if there's anything to report, I will.

@JBS5 , your trigger helped ;)

I just installed the firmware and the latest deCONZ .deb. So far I can confirm that source routing works, of course no conclusions on the stability yet. I'll keep you posted

ZigBee Network Layer Data, Dst: 0x0ea4, Src: DeConz
    Frame Control Field: 0x0608, Frame Type: Data, Discover Route: Suppress, Security, Source Route Data
    Destination: 0x0ea4[0x0ea4]
    Source: 0x0000[DeConz]
    Radius: 10
    Sequence Number: 190
    [Extended Source: dresden-_ff:ff:00:c4:9a (00:21:2e:ff:ff:00:c4:9a)]
    [Origin: 366]
    Source Route, Length: 2
        Relay Count: 2
        Relay Index: 1
        Relay 1: 0xc803[0xc803]
        Relay 2: 0xf1e5[0xf1e5]
    ZigBee Security Header

@ReX1983 Look how to OTA plugin is working https://github.com/dresden-elektronik/deconz-ota-plugin and updating your devices.
Sorry but its one IKEA device communications problem.

@morfei1 @peer69
I can confirm that its not only caused by the GU10 light bulbs. I had the routing and lost connection issues for a long long time with no GU10´s in my system.

I had some initial problems after .82 release with the new firmware on my conbee 1.
But now after the network mesh settled down and a few power cycles its been rock solid for the last couple of days.

Dumb as I am, I decided to update to the latest firmware (OTAU) on all my bulbs to have a better baseline to start from if there´s future issues. Wish me luck. Will keep you guys posted on the progress.
image

@tubalainen are you on HA add-on?

@tubalainen are you on HA add-on?

Nope, I use HA Core on Debian in docker - so I also use https://github.com/marthoc/docker-deconz stand-a-lone docker package. They both use the .deb file provided by dresten so they should work identical. I am not sure if the OTAU plugin is included in the deconz HA add-on....

just chipping in with my observations so far;

I only have issues with my Ikea GU-10 (23 of them), and I use Home assistant, that connects to deconz (docker) via the rest api, and I can see HA fireing the events to deconz with success.

When I use HA to turn on, 17 bulbs, one-by-one (right after eachother) maybe 3 out of the 17 dosnt turn on, but HA sees them as on.
BUT, if I make a group in deconz. put all the lights in that group, and tell HA to turn on that group, there havnt yet been any issues, all lights turn on.

Using firmware: 0x26650700 i havnt seen any improvement (other than i had to re-pair those 17 bulbs for some reason, and even after re-pair i had the same issues)

is it a limitation on the rest-api maybe?

@MartinTerp , if you use single command for each device at a time, rather than 1 command to a group it will fail, maybe not everytime but most of the time a random devices that is supposed to do something will not do that.

My workaround for this is 5sec delay between commands. For example
If I want to switch my bedroom lights at 23:00 at bri: 127 and ct: 454, and also in my hallway I give 5 seconds delay for one of the room. If I don't put the delay, it will randomly fail to complete the full command for one of the rooms or maybe for both of them. I experiment alot with this. Its something similar like the Ikea bug it can't handle switch color and brightness at the same time.

I don't know what exactly this is, a deconz bug, a zigbee limitation in general or a Ikea FW bug. Maybe it my lack of knowledge how zigbee works, but the 5sec delay always work for me.

As general rule I always use: sand as little commands as you can at a time or use a delay. Thus you keep the channel clean.

If you need to switch many groups/lights at a time, as you've noticed it's always better create a new phoscon group include these groups of lights in it and send a single group command to them. You can have many different groups including the same light. We are not limited to just using one group per light. If you don't like to have many groups in phoscon, then the delay is your best friend.

@morfei1 I'm using a similar workaround and it works most of the time with 5sec delay between all commands.
@MartinTerp I've also had this issue with other devices than the IKEA-Tradfri lights, for example the OSRAM Smart Plugs and Light Strips. I guess it is possible though the problem was caused by some buggy Tradfri-Lights not forwarding the commands.

But its not suppose to be like this, is it?
at the moment the script turns them on 1-by-1 with 1 sec delay, it SHOULD be able to handle that?

@morfei1 @MartinTerp

I also used to use delays + that I send on and off commands from home assistant twice per triggered automation.

I have now removed the delays and repeat commands since .82.

I'm pretty sure it is not supposed to need a 5sec delay. I think this wasn't needed with some earlier version but I would not bet on it since my network has grown since then. Maybe 82 (haven't tried with reduced delay yet) or future versions will improve this behaviour.

@morfei1 @githtz @tubalainen @martinterp

Could we have the discussion on how to update device firmwares elsewere?

We're not taking about updating firmware?

When I read it correctly the issues of sending multiple commands, where not all get through is different. This is likely more an issue with the task scheduler.

I'd suggest to open a separate issue for that like "Not all lights react when sending unicast commands to multiple lights".

The issue here is more about when lights are completely lost and aren't reachable at all, which can be related to routing.

Redacted by Mimiix

Redacted by Mimiix

@inconsx @ReX1983 I thought i was clear on https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-696541484 and https://github.com/dresden-elektronik/deconz-rest-plugin/issues/1261#issuecomment-696046160

Please stick to these rules. I have redacted your posts. Open seperate issues if you'd like.

So here is my feedback, i have upgraded to latest firmware and the deconz api. I run my network from HA.

I have about 72 nodes, most of them are IKEA GU10 lights. After upgrade i noticed two different GU10 that "died" two different days, the only way to get it back is to power cycle. The GU10 use fixture of 1.2.217 and 1.2.221. I'll try to upgrade them all to same version.

I have only 4 GU10, I think these are from the first released version, running on the latest ota version. Never had a problem with these in terms of reachability, just my my scenes were messed up after the light firmware update.

image

I just had one of my IKEA TRADFRI bulb E14 W op/ch 400lm Version 1.2.214 stopped responding. This lamp has done this many times and on different deconz firmwares. Right now I'm using deconz 26370500 and 2.05.82.
Skärmavbild 2020-09-27 kl  22 35 47

I have only 4 GU10, I think these are from the first released version, running on the latest ota version. Never had a problem with these in terms of reachability, just my my scenes were messed up after the light firmware update.

image

Have you seen a stability improvement with these GU10 bulbs after the update to 2.3.050? I am now on 0x26650700 and since the update to 2.3.050 (of 17 bulbs) it seems my network is way more stable. Devices don't go offline overnight and my Aqara buttons/switches work at the first attempts. It is now 4 days, so early to say.

I have only 4 GU10, I think these are from the first released version

There are actually more cheaper version of IKEA GU10 (W only). They have never been updated to 2.x software and still running on 1.2.214. And these were the most problematic ones. I'm personally just give up and now they are running smooth with IKEA hub.

I think @antonbo is getting nearer the course of the problem.
IKEA is using the same firmware images with different hardware (PSU / LED drives) and have different settings in the "userdata" for different hardware (and the model ID is stored in it) . So one firmware can working OK on one hardware (E14) and having problems on others (GU10 / E27).
One difference is that the IKEA GW is running in ZLL mode (on Zigbee PRO) and is not pulling device status only listen on status changes the devices is sending in the network (ala Zigbee PRO / ZB3 standard) and mixing them with HUE and other vendors that the coordinator is pulling status from the devices (Not ZB3 behaviour) can making one mess of the network.
I have the oldest IKEA RGBW and one E27 Opal WW on old updated (1.X) firmware that running OK and one bunch new ZB3 GU10 WS that working great. My backbone is around 10 IKEA plugs that is holding the network communication stabile and also my Xioami sensors is working OK with all of them.
My feeling is that is not one generell problem more HW and perhaps FW in combination with the network layout and avoiding some problematic devices (like OSRAM Outdoor Plug that is corrupting packages and losing children all the time).

@manup I can only say that for my system the new firmware has made it much worse than before. Whether it's the source routing or what it is i now have each day one or two bulbs stuck and need power cycle. I have even seen one of my Philips hue bulbs stuck but that did not require power cycle. I don't know if the problems are now caused by the old firmware i am running on them but then again i can't upgrade either since it get's stuck. :(

I wonder if anyone know if the newer GU10 still has issues or they don't have the issues ? I heard newer devices run zigbee 3.0, my wife is not happy about these issues so I could change all of them if it helps ?

I've had a hanging light this morning. I think it's one of the newer ones.
image

The light did not react to any command, was shown as offline. A powercycle didn't fix the issue. I had to reset and re-add it to the network.

@JBS5 @djwlindenaar @lubbertkramer Any feedback on the latest firmwares in comparison with this issue?

I can add that for 48 hours I’ve not had any issues of what this topic is addressing and what I’ve previously experienced.

However, need to add following

  • restart the container with deconz every night
  • moved the grouping from Home-assistant to deconz/phoscon to handle.

I had some less responsive bulbs before above changes although a better experience but not as of now.

@Mimiix , well, it's a bit early to tell.

What does annoy me is that the network feels much more sluggish. The time between pressing a button and the light turning on (via a rule) is longer than before. Also turning the light on and changing the brightness now is a two-step process with some time in-between. I still need to look into the sniff logs to check what's causing this. Probably this has nothing to do with the change to source routing, but there you go.

I did have one light that got unresponsive, but I haven't analysed that situation either, so difficult to tell what happened.

My E27 Ikea bulbs have collectively stopped to cooperate.
So far, power cycling has worked for a few of them. 3 are still not back. Guess I have to re-add them ... :(

Firmware: 26610700
Version: 2.05.84 / 14.9.2020
(Raspbee2)

@umrath This seems unrelated to the issue addressed in this issue. Please open a seperate issue :)

The symptom looks very much the same to me: Lamps becoming unresponsive for no apparent reason.

As i just said: This seems unrelated. please open a seperate issue. We can always determine afterwards that this is related. I keep a tight ship on this issue.

Yes I also think its related. After a few months without the bug, yesterday I had the issue again with two devices, a (very old) IKEA E27 and the newer IKEA outlet, both at recent firmware.

The only thing I did at the time was to transport a KADRILJ blind out of range and later back, but this might be totally unrelated.

Firing up the sniffer shows the same problem:

  • The devices still periodically send NWK Link Status commands;
  • But always with an empty list, it seems they ignore all surrounding routers;
  • They ACK incoming commands on MAC level;
  • They do react to Beacon Requests and send a Beacon out, but that's the only "reply" they give.

Somehow it appears the Zigbee stack on the IKEA devices crash partly for NWK and APS layers and above or maybe incoming NWK buffers are full and never released.

I tried to bring the devices back to live by sending:

  • ZDP Leave with rejoin;
  • NWK Leave with rejoin (lower level).

Both are ACKed at MAC level but are otherwise ignored.

I've then tried to fake coordinator NWK Link Status commands with an valid entry which indicates that the device has working incoming and outgoing link cost (3/3) in the hope that the device would pick up the coordinator in it's neighbor table. ... didn't work either.

Unfortunately there doesn't seem to be a way to workaround the situation after it happened and the IKEA device goes into this mostly dead state. Looking through some forums the behavior can be seen with all kinds of gateways and underlying Zigbee stacks. Which is interesting since for example the Hue bridge doesn't do any fancy Zigbee stuff like querying neighbor tables or bindings and reporting and still the bug occurs.

What IKEA needs to do is to at least implement a Watchdog which checks that NWK and APS components are still operating and let the device trigger the watchdog to reboot. This would not fix the bug but it would at least keep the device functional when it happens.

@manup Is you old E27 one LL with diagnostic cluster ?
Perhaps its can being interesting looking if something is happening with the counters there with some weeks between.
It can't fixing the problem but giving one hint of wot the firmware dont laiking.
Good observing and conclusions mande !!

I think Silabs is getting little more bug hunting to do for one of there largest customer ;-)

Hi, this confuses me as since I moved to Z2M using the ZZH some months ago I’ve not seen this issue. Maybe there is some other variable in the equation. Would be great if someone else who’ve done the same move could confirm.

@mvjt Is you using Rasp/CornBee with Z2M or some other Coordinator ?
Different zigbee stacks is working in different ways and can having / making different problems.

Edit: Sorry I see ZZH = new TI coordinator.
deCONZ NCP = Atmel / Microchip
IKEA devices / NCP = Sirlabs EFR32 / EZSP

I can confirming HA / ZHA / Zigpy / Bellows works great with "Billy EZSP" = IKEA ICC-1-A module as coordinator with IKEA routers and end devices from IKEA and Xiaomi and NO OSRAM or Xiaomi routers.

@MattWestb Yes the E27 has the Diagnostic cluster but I haven't power-cycled it yet to keep it in the error state. By reading the cluster attributes from my other IKEA lights the attributes report all to be not available :/

At least in the past the issue I'm seeing happened also with TI CC253X and at the end of the issue the CC26X2R1 is mentioned, but that was the state in January it might have improved meanwhile. https://github.com/Koenkk/zigbee2mqtt/issues/2032#issuecomment-547483373

It might not be a Silabs bug in general, could also be some custom stuff in the application layer. I think recent Hue devices also use Silabs. From the Hue bridge there are at least TI and Microchip versions out there.

It's a really nasty thing, in many cases the bug might not happen at all or only after a few months, in other networks it happens in much smaller intervals. I guess it is also significant how the networks are operated and that lights which are powered off regularly are less affected.

@manup I have one scenario that is passing very good on your "IKEA problem".
If you have setting up your network its starts with one network key and the Security Frame Counter is start ticking. If you not have reforming the network or rolling the network key for very long time the counter is stalling because its have coming to the max 0xFFFFFFFF ( 32 bits) and cant increasing.
Then the device is throwing all data that is coming in the zigbee layer because the frame counter is wrong but the under network layer is still working normal.
The problem i dont knowing one method getting the standing of the Security Frame Counter of devices. I think that is not possible see it in Wireshark (thinking = not knowing).

Thing that is pointing to this is that early paired devices that is not resetted is stalling faster.
Newer paired / resetted devices is running longer before the counter is stalling.
Mutch commands to one device is also making it stalling faster then one device that is not so "communicative ".

Its recommended rolling the network key on regular basis in ZB3 network for keeping it secure and then its also resetting the Security Frame Counter so its not being stalled.
Some info AN1233: Zigbee Security 2.1.6 The Network Security Frame Counter.

Its one wild brainstorming but it can being way the problem is coming after long time running network.

One test is trying rolling the network key and the "living" device suld running OK for pretty long time but the stalled one must being reseted / rejoined for start the Security Frame Counter from zero.

Hrm, pretty sure rolling the network key shr s going to drop all xiaomis.

Wrong 200% secure that they is not updating and leaving the network (the old no ZB3 ones) !!!

@Adminiuga Do you have you trying rolling the network key on EZSP ?
I finding it interesting of the result !!!

Haven't tried rolling the network key. Actually would be interesting to know how many implementation roll the key on a regular basis.
But I can confirm that pan-id change have very high chances of xiaomis dropping, like 4 out of 5 chances

I have seeing some security penetration tests that have sniffing pan-ID from the street and going back several times with some months between and no of the large light supplier (not named here) was not rolling the network key after one year (Mariahilferstraße in Wien) so you more than likely having right (agen).

If and only if its the security frame counter that is doing the problem then its doing the same for all real zigbee 3 devices then its defined in the "Base-Device-Behavior-Specification" and recommended in old LL but very likely not with no zigbee PRO (old HA and LL) devices or not certified devices (no named ....mi).
Then is better "manual" rolling the network key ones or twice a year and don't need demolishing the house for resetting build in devices more times in the mouth then they is stalling and taking resetting / rejoining old Xiaomi sensors that is normally not being integrated in the house structure and being easier to reseted (normally open joining and pressing the button and they is connecting).
Menny "chinese Zigbee 3" is throwing the "old" security frame counter after power reset and using one new from the first arriving package from its neighbours (have seeing that then testing tasmotas zigbee 3 problems then the NCPs counter was wrongly resetted on restart) so them is only repower and they is back in.

We had also made some tests last year, no gateway did rotate the network key. As described in the specification that's the only way to reset the NWK security frame counter when incrementing the network key number. I also share concerns that this is supported by all devices, usually features which are almost never used are not well tested. I truly hope I'm wrong here and it works for most devices.

In any case resetting the frame counter is most important for the gateway since it has the biggest number of outgoing commands, lights and sensors should be fine for the next years. Currently this isn't an issue but it becomes one in a 2–3 years when the gateway counter is reached in larger networks. For that we already have plans to check if the rotation of the network key as well as the key number and frame counter reset works.

Anyway the problem here is different, frame counters of the gateway and the lights in error state are still fine and quite low.

@djwlindenaar I am wondering of you have any new findings since you are able to technically analyze your findings in addition to just reporting that a lamp has gone offline again, as I can do. Your findings and insights are greatly appreciated. :)

Well, thanks for the appreciation... To be honest, I'm not entirely happy with the current network stability. It did go down since updating to the latest deconz firmware. Some lights went offline in the last weeks, where this did not happen for a long time before the upgrade. I did run with the patches which enable regular attribute reporting for IKEA lights, which I didn't apply yet in the current situation

Although it's fun to analyze this, it is a hobby for me and my time is now fully taken up by some remodelling of the house. I'll see if I can make a bit of time for it...

Well, thanks for the appreciation... To be honest, I'm not entirely happy with the current network stability. It did go down since updating to the latest deconz firmware.

I am having the same negative experience. Now most of my Xiaomi devices (mainly Aqara wall switches) go offline frequently during the day and return working after few minutes (I suppose this is due to the fact that Xiaomi devices reboot if they don't receive a response to the request of the attributes of the Time cluster from the coordinator).

The new v2.5.88 release might improve situation. Here IKEA reporting configuration was smoothened so that state transitions don't bombard the network with reports. Prior during state transition every attribute was reported at a very fast pace.

The new v2.5.88 release might improve situation. Here IKEA reporting configuration was smoothened so that state transitions don't bombard the network with reports. Prior during state transition every attribute was reported at a very fast pace.

Sounds promising :) Also on/off is a state transition I guess? Or mostly brightness or color mode changes?
Any minimal firmware version needed/recommended for this change?

Only brightness and all color specific attributes like colorX, colorY and color temperature.
For firmware I'd always recommend the latest one 0x26660700 (in case of ConBee II and RaspBee II).

The new v2.5.88 release might improve situation. Here IKEA reporting configuration was smoothened so that state transitions don't bombard the network with reports. Prior during state transition every attribute was reported at a very fast pace.

@manup, I think this update solved also the stability issues with the Aqara devices. Thanks

The new v2.5.88 release might improve situation. Here IKEA reporting configuration was smoothened so that state transitions don't bombard the network with reports. Prior during state transition every attribute was reported at a very fast pace.

@manup, I think this update solved also the stability issues with the Aqara devices. Thanks

Unfortunately the problem (#3605) is still here, I rushed to conclusions too early

Was this page helpful?
0 / 5 - 0 ratings