ESPEasy 🚀 - [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects

Sound like I have same issue. Sometimes my esp lost wifi connect and keeps “lost” from wifi for hours. I though they must reboot and reconnect to wifi thanks to watchdog but they dont. A cold boot solve the issue.

wolverinevn on 1 Nov 2018

That last thing described by @wolverinevn is something I have seen happening here too.

TD-er on 1 Nov 2018

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

clumsy-stefan on 1 Nov 2018

as we discussed in #1957 I'm quite sure a lot of these strange WiFi/networking issues come from this layer 2 instability... I've seen all kind of strange behaviour before changing my AP to some increased timings...

Hope you will find the solution. One of my Nodemcu hangs and disappear from router for 5 hours until now without reason. I hope it can recover from watchdog but no thing happen. I have to reboot it manually now. Very annoyed!

wolverinevn on 2 Nov 2018

@wolverinevn when you have access to the node again go to tools=> advanced and set the "Connection Failure Threshold" to something else than 0 (I suggest something between 50 and 100, depending on the nr. of tasks you have). This does actually not change the problem but increases the chances that the node reboots and reconnects significantly!

the other workaround would be if you can tweak some parameters in your accesspoint, depending actually what type of AP you have if that can be tweaked...

clumsy-stefan on 2 Nov 2018

👍2

@clumsy-stefan Should we set the default in ESPeasy to this level too?

And maybe we should also display this value in the sysinfo page and make it available for rules?

TD-er on 2 Nov 2018

@TD-er setting it to some level by default is probably not a bad thing if it's not too low (it can always happen that a connection fails).

When debugging the issue I thought about how this is done, currently every unsuccelsful connection increases the counter and ever succesfull connection decreases it. I thought about if it would be more logical to reset it to 0 as soon as a succeful connection happend, but I guess that's a bit a ideological question what makes more sense.

The issue with that number is, if you have 10 tasks, each of them with a retry count of 10 and a resend delay of 100ms, the reboots happen quite quickly if there is a real comms problem (100 retries within about 10 sec.).

now if you have for example always 5 comms failing and 1 successfull, you'll be continiously increasing connection failures. if this happens all the time you will reboot the node sooner or later even though all data could be delivered.

the main issue I'm seeing though is, that somehow the node is not realizing that the connection on layer 2 is actually gone and continues to send data (I guess). besides this what I realized tonight, what happens to syslog (and other comms like NTP etc.) if there is no wifi connection? Is this also stopped? this could explain why my nodes suddenly jump to 100%cpu when layer 2 is gone. probably no more task data is sent, but it tries to get rid of the UDP syslog packets and can't... just a guess though...

sorry, long text for two simple questions... in short:
default level: yes I'd set it to the max (100) or so by default... if everything is ok it does no harm if not, the unit gets accessible again...
sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

clumsy-stefan on 2 Nov 2018

@clumsy-stefan I've already set it to 50. Lets wait. ;)

wolverinevn on 2 Nov 2018

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

clumsy-stefan on 2 Nov 2018

sysinfo page and rules: I'd say no, why should this be dynamically changed? it's an emergency values...

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects.
After all, it is a performance statistics value

TD-er on 2 Nov 2018

I meant to be inspected in rules using a system variable like %conn_fail% and show it on the sysinfo page, next to the number of wifi reconnects. After all, it is a performance statistics value

ah, yes, agree, that would make sense! that's also a bit related to the issue #1993. Having a plugin that sends a number of system/performance variables regularly to the controller (without wasting the limited available tasks) would be really great!

clumsy-stefan on 2 Nov 2018

@wolverinevn hmm... 50 should happen quite quickly depending on the number of tasks interval and retries you have (5-15min.)... if this does not help I think the node is actually not frozen, but it just can't reconnect to the network. I had this also even after a WD reboot. can you see if the node tries going to AP mode? do you see the AP-WLAN of the node?

I have 9 tasks, 3 of them are Dummy and MQTT_import. I think the rules is a little bit busy with computing and reading sensors, I tried to limit mqtt_publish by calling in rules every few minutes. Load is arround 29%.
As I remember, last time it was frozen this morning, I can't find the AP of Espeasy (if you mean AP_WLAN is operating in AP mode).
My setup (network, location of ESP) was working greate with another Nodemcu running 2.3 or 2.4 which was released on March.

_Uptime is 7hrs and 20mins, RSSI is -71dbm, there are a few wifi around me.
Last Disconnect Reason: | (200) Beacon timeout
Number reconnects: | 35_

wolverinevn on 2 Nov 2018

@wolverinevn the problem with this issue is, that it happens completely random. I have ~30 nodes running, some of them faced the issue some of them not, some rebooted, some wnt to AP mode...

It really seem to be a combination of how busy the node is, how busy the air is (eg. numebr of wifi devices) and how your AP acutally handles certain conditions (missnig layer 2 acks etc.)...

so I guess until we find a way within the application (ESPEasy) to reliably detect this condition and act on it, there is no "real" solution....

clumsy-stefan on 2 Nov 2018

👍2

@wolverinevn PS: you're not using mikrotik AP's by chance?

clumsy-stefan on 2 Nov 2018

@wolverinevn About the number of reconnects (in your edit)
35 reconnects in about 8 hours is a lot.
I have nodes here running for days which only have a handful of reconnects.
The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

Connected | 19d22h33m
-- | --
Last Disconnect Reason | (202) Auth fail
Number reconnects | 1

TD-er on 2 Nov 2018

@wolverinevn PS: you're not using mikrotik AP's by chance?

No. I'm using router running Padavan firmware (kind of ASUS).

@TD-er I knew it. I'm inspecting the reason, may be noise from buck module nearby. Another one has 0 reconnect after 2 hours.

wolverinevn on 2 Nov 2018

No. I'm using router running Padavan firmware (kind of ASUS).

Unfortunately I don't know this FW at all... Any chance to tweak layer 2 parameters? Something like frame ack timeouts or similar? Some kind od "distance" settings?

clumsy-stefan on 2 Nov 2018

@clumsy-stefan Unfortunately, I don’t see anything like that.

wolverinevn on 2 Nov 2018

@clumpsy-stefan The unit was rebooted 2 times last night with 50 failure threshold set. Good news is there no frozen any more. Today I will try to improve wifi connect by some minor changes in hardware setup.

wolverinevn on 3 Nov 2018

3 Wemos units in the same room, connected to the same AP.
Reconnects in the last 16 hours or so,
With Rule: On WiFi#Connected ....

26 WD reboots and 104 re-connections:
muc21_capture

9 WD reboots and 32 re-connections
muc19_capture

2WD reboots and 40 re-connections
muc14_capture

All have 50 failure threshold set

Domosapiens on 4 Nov 2018

@Domosapiens & @wolverinevn one more thing you can try is increasing the group-key-timeout on your AP (if you have such option). Normally that's around 5min. You can try to increase to 30min. or even 1h and see if it improwves (as long as it's not in a super high security network, which I don't assume if you have IoT's in it)...

clumsy-stefan on 5 Nov 2018

@TD-er

The most stable one is running for 20 days 11 hours 46 minutes now and only 1 reconnect.

I also have currently units that ran for over 3 days now and other that rebooted within a day...

I did see some issues with the rekeeying of the group key. it somehow seems, that in newer versions of the core it can happen that the rekeying runs into a timeout... however the application should act on this and not go into some high-load not responsive mode... but I'm not sure where it's failing..

clumsy-stefan on 5 Nov 2018

@clumsy-stefan thanks for your suggestion.
I see in my dd-wrt router a param. "Key Renewal Interval" with value 3600 (in seconds).
So that should be fine ?

rekeying runs into a timeout

That would explain only an hourly re-connect imho.
Thanks to the great rule feature _WiFi#Connected_ we can see this.
No idea how older versions performed on this.
A hidden problem already for a long time?

Domosapiens on 5 Nov 2018

after setting my group key reneval from 5min. to 1h the units run much more stable. I assume, that with 40 clients connected to one AP doing every 5min. a rekey just got the air a bit too busy.. after that I could even decrease the frame-ack timeout again and the units got more responsive again, also I got less "connection timeouts" after changing these parameters...

@TD-er still I think this "group key timeout" and the "assoc fail" afterwards needs to be captured and handled by the application. I get the feeling that the unit continues to try to send queued messages to the controller/server while trying to do a rekey which makes the unit's too slow and rekeeying fails... therefore disconnecting on layer 2.

just a rough idea though, I'm still trying to really pin it, but that^s the closest I could get until now.

clumsy-stefan on 5 Nov 2018

@TD-er jsut now I experienced again a node that goes into a indefinite loop of reconnect tries and always gets expires (see log below). intersting is, that it obvisouly realizes it's not connected and does not try to send data to the controller, which means the "failed connect attempts" does not increase anymore and therefore never hits the failed connection limit.

also the load shows always 100%, that's why I guess it does not succeed to reconnect as it's always too slow to actually do the handshake.... sems like a tail-bite to me... just memory is slowly decreasing (until I assume it crashes because out of mem)...

I'm going to test with #2073 and see if this situation occurs again... however, it occurs very rarely on the two serial connected nodes, so that I'm able to really track what's going on...

105587 : EVENT: WiFi#Disconnected
3105617 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 129 ms
3131325 : WD   : Uptime 52 ConnectFailures 84 FreeMem 12976
3134621 : EVENT: WiFi#Disconnected
3134652 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5141 ms
3141487 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3141488 : WIFI : Connecting clumsy_ap2 attempt #53
3142671 : EVENT: WiFi#Disconnected
3142700 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3153443 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3153444 : WIFI : Connecting clumsy_ap2 attempt #54
3153713 : EVENT: WiFi#Disconnected
3153743 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 157 ms
3161324 : WD   : Uptime 53 ConnectFailures 84 FreeMem 12976
3165518 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3165519 : WIFI : Connecting clumsy_ap2 attempt #55
3178542 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3178543 : WIFI : Connecting clumsy_ap2 attempt #56
3179728 : EVENT: WiFi#Disconnected
3179757 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3189704 : SYS  : 31.00,12808.00,100.00,53.00
3189708 : EVENT: sysinfo#rssi=31.00
3189743 : EVENT: sysinfo#mem=12808.00
3189773 : EVENT: sysinfo#load=100.00
3189804 : EVENT: sysinfo#uptime=53.00
3191297 : WD   : Uptime 53 ConnectFailures 84 FreeMem 12800
3191441 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3191442 : WIFI : Connecting clumsy_ap2 attempt #57
3191576 : EVENT: WiFi#Disconnected
3191606 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 125 ms
3200507 : EVENT: Clock#Time=Tue,10:25
3204458 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3204459 : WIFI : Connecting clumsy_ap2 attempt #58
3217493 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3217493 : WIFI : Connecting clumsy_ap2 attempt #59
3218677 : EVENT: WiFi#Disconnected
3218707 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1122 ms
3221325 : WD   : Uptime 54 ConnectFailures 84 FreeMem 12800
3245444 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3245445 : WIFI : Connecting clumsy_ap2 attempt #61
3249673 : SYS  : -80.00,12632.00,100.00,54.00
3249677 : EVENT: sysinfo#rssi=-80.00
3249709 : EVENT: sysinfo#mem=12632.00
3249741 : EVENT: sysinfo#load=100.00
3249772 : EVENT: sysinfo#uptime=54.00
3250620 : EVENT: WiFi#Disconnected
3250650 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5130 ms
3251307 : WD   : Uptime 54 ConnectFailures 84 FreeMem 12624
3259435 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3259436 : WIFI : Connecting clumsy_ap2 attempt #62
3260490 : EVENT: Clock#Time=Tue,10:26
3260650 : EVENT: WiFi#Disconnected
3260679 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1121 ms
3273521 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3273522 : WIFI : Connecting clumsy_ap2 attempt #63
3273659 : EVENT: WiFi#Disconnected
3273689 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 125 ms
3281281 : WD   : Uptime 55 ConnectFailures 84 FreeMem 12624
3287445 : WIFI : AP Mode ssid will be wemos-mini-18_18 with address 192.168.4.1
3287446 : WIFI : Connecting clumsy_ap2 attempt #64

clumsy-stefan on 27 Nov 2018

It is hard to connect with rssi - 80

uzi18 on 27 Nov 2018

RSSI value is not reliable when the node ESP is not connected! same node has less than -70 when conected.... on deep-sleep nodes they even show the "not connected" default of +31 when sending the values!
and after a reboot it runs without issues... (without moving it...) if you read above, it's a layer 2 issue which is reproducable when you tweak parameters on the AP....

PS: you can try it yourself! jsut connect 20-30 Nodes and lower the group-key timout to 5min. and the frame reply threshold value to something like 7... see what happens!

clumsy-stefan on 27 Nov 2018

I think it is wrong place for issues with layer2, you should filll issue on esp8266 core or maybe better on nonos sdk project
But please don't shout here

uzi18 on 27 Nov 2018

it only happens with ESPeasy and not with oter types of firmwares I tried. I assume the node gets too busy at some point in time and not leaving enough time to the core do the rekeying (everything explained a number of times), so feel free to read the debugs and explanations...
but I agree, you don't need to answer actually...
PS: @uzi18 had you ever had 30 nodes running succesfully at the same time?

clumsy-stefan on 27 Nov 2018

@TD-er: in ESPEasyWifi.ino lines 650 - 669 the switch statement's default match breaks out of the switch and therefore tryConnectWiFi() returns true even though WiFi.status() is not necessarily WL_CONNECTED but can be any other state (only 2 false return states are checked..).

Chaning this and returning trueonly if the WiFi.status() actually returned WL_CONNECTED solves at least one of the layer 2 disconnect/exception issues I'm facing!

What do you think?
Am I missing something or why should tryConnectWiFi() return when WiFi.status() is not ?WL_CONNECTED`?

clumsy-stefan on 4 Jan 2019

👍1

@clumsy-stefan Good to see you're still digging into the WiFi code.

https://github.com/letscontrolit/ESPEasy/blob/5ee18ec556c9c58802af29f5fd78593905ef35c1/src/ESPEasyWifi.ino#L604-L671

The initial idea of this function was to start the WiFi connect sequence.
Maybe its function has a bit changed throughout all the changes since.
But you may be on to something here.
I think it may be a proper change to return true (in the end of that function) only if the status returns it is connected.

Can you describe what seems to be the other layer 2 issue you're facing?

TD-er on 4 Jan 2019

can't say exactly when the other exception occurs, but at least there is a difference if this function returns only truewhen it the status is WL_CONNECTED I attach the debug before and after the change..

before

156874 : EVENT: WiFi#Disconnected Processing time:74 milliSeconds
156876 : WIFI : Disconnected! Reason: '(0) Unknown' Connected for 2 m 32 s
156877 : WIFI  : Arduino wifi status: WL_CONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
157208 : WIFI : Connecting clumsy_ap2 attempt #0
157212 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
scandone
160069 : Fatal exception 9(LoadStoreAlignmentCause):
epc1=0x40105cd4, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000003, depc=0x00000000

Exception (9):
epc1=0x40105cd4 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000003 depc=0x00000000

after

108304 : EVENT: WiFi#Disconnected Processing time:73 milliSeconds
108307 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2014 ms
108308 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
109217 : WIFI : Connecting clumsy_ap2 attempt #1
109220 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_DISCONNECTED
scandone
state: 0 -> 2 (b0)
state: 2 -> 3 (0)
state: 3 -> 5 (10)
add 0
aid 1
cnt

connected with clumsy_ap2, channel 9
dhcp client start...
112113 : WIFI : Connected! AP: clumsy_ap2 (4C:5E:0C:39:F6:55) Ch: 9 Duration: 2895 ms
112115 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_CONNECTED
ip:10.0.10.117,mask:255.255.0.0,gw:10.0.0.2
113751 : WIFI : DHCP IP: 10.0.10.117 (wemos-mini-17-17) GW: 10.0.0.2 SN: 255.255.0.0   duration: 1636 ms
113765 : EVENT: WiFi#Connected

Hmm.. I just realize, that after this the internal status does not (yet) match the Arduino status... this could also lead to issues I guess...

clumsy-stefan on 4 Jan 2019

@clumsy-stefan this status is because we can't relay on Arduino wifi status, that's why @TD-er introduced ESPEasy status, but ok maybe we can try to double check if every status in code is properly checked.

uzi18 on 4 Jan 2019

It could be this wifi status has been fixed in core 2.5.0, so maybe our own status has become obsolete.
That would be nice, since it makes the WiFi code rather complicated and thus prone to errors.

Edit:
I'm looking at this error you gave:
Fatal exception 9(LoadStoreAlignmentCause):
One of the more recent fixes in core 2.5.0 is about the constructor of IPAddress, which should fix problems when the alignment of the given byte sequence isn't 32-bit aligned.
Maybe this is something similar?

TD-er on 4 Jan 2019

That's one of the guesses I have, that the ESPEasy status is not always in sync with Arduino status. Especially temporary disconnects on layer 2 (eg. WiFi rekeeyings) are probably not really handled / realized.

One other thing couldbe the opposite, that ESPEasy thinks it's disconnected and tries to reconnect but the core is still conencted and therefore leads to an exception. but can't prove that one yet...

about the alignment, yes, can be, but can't nail this either currently...

so the only thing I'm quite sure currently is the returncode of tryConnectWiFi()should match the actual connection status or at least check for WL_CONNECTED...

clumsy-stefan on 4 Jan 2019

@TD-er I'm somewhat more concerned about

connected with clumsy_ap2, channel 9
dhcp client start...
112113 : WIFI : Connected! AP: clumsy_ap2 (4C:5E:0C:39:F6:55) Ch: 9 Duration: 2895 ms
112115 : WIFI  : Arduino wifi status: WL_DISCONNECTED ESPeasy internal wifi status: ESPEASY_WIFI_CONNECTED

For me, the last two lines indicate that the core did not yet update the status, even though ESPEasy thinks it did... so it could end up in some race condition here...

after this happens sometimes I start to see a lot of

7989956 : Read settings: ControllerSettings index: 0
7989997 : Read settings: ControllerSettings index: 0
7990130 : Read settings: ControllerSettings index: 0
7990267 : Read settings: ControllerSettings index: 0
7990399 : Read settings: ControllerSettings index: 0
7990531 : Read settings: ControllerSettings index: 0
7990664 : Read settings: ControllerSettings index: 0
7990799 : Read settings: ControllerSettings index: 0
7990938 : Read settings: ControllerSettings index: 0

from which it never recovers...

a bit later it tells me:

8185850 : ip:169.254.37.119,mask:255.255.0.0,gw:0.0.0.0
Read settings: ControllerSettings index: 0
8185975 : WIFI : DHCP IP: 169.254.37.119 (wemos-mini-18-18) GW: (IP unset) SN: 255.255.0.0
8185990 : EVENT: WiFi#Connected

No clue where this comes from... but after this it starts to try to connect to the controller/server which obviously fails until it runs out of tries (100) and reboots...

EDIT: btw, you can force this behaviour if you kick the node off the AP manually...sometimes it just reconnects, sometimes it happens what's described above...
~~EDIT2: sometimes it crashes with Exception 9... so it seems to be some kind of race-condition how exactly it recovers from a disconnect!~~ my fault, had a addLog() in the onDisconenct() callback...

clumsy-stefan on 4 Jan 2019

it's more or less always the same situation. after the 4-way handshake fails (rekeeying) it never recovers anymore... not sure how to force a recovry of this..

at least adding some delay(100) in the processConnect() and processDisconnect thus giving the core time to update the WiFi status makes the WiFi satuses in sync again. This also makes the units get into the below situation much less often!

900695 : WIFI : DHCP renew probably failed
900697 : Reset WiFi.
900699 : WIFI : Connecting clumsy_ap2 attempt #0
901713 : EVENT: WiFi#Disconnected
901772 : WIFI : Disconnected! Reason: '(8) Assoc leave' Connected for 14 m 56 s
902326 : WD   : Uptime 15 ConnectFailures 44 FreeMem 20248 WiFiStatus 0
907048 : EVENT: WiFi#Disconnected
907106 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 6172 ms
907786 : WIFI : Connecting clumsy_ap2 attempt #1
911821 : EVENT: WiFi#Disconnected
911879 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3860 ms
911894 : WIFI : Connecting clumsy_ap2 attempt #2
912793 : EVENT: Clock#Time=Sat,08:29
919974 : EVENT: WiFi#Disconnected
920033 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 7962 ms
920824 : WIFI : Connecting clumsy_ap2 attempt #3
922083 : EVENT: WiFi#Disconnected
922141 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1125 ms
922805 : WIFI : Connecting clumsy_ap2 attempt #4
923138 : EVENT: WiFi#Disconnected
923196 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 133 ms
925831 : WIFI : Connecting clumsy_ap2 attempt #5
931179 : EVENT: WiFi#Disconnected
931238 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5165 ms
931775 : WIFI : Set WiFi to AP+STA
932701 : EVENT: WiFi#APmodeEnabled
932778 : WIFI : AP Mode ssid will be wemos-mini-17_17 with address 192.168.4.1
932778 : WIFI : Connecting clumsy_ap2 attempt #6
933023 : WD   : Uptime 16 ConnectFailures 44 FreeMem 17856 WiFiStatus 0
934065 : EVENT: WiFi#Disconnected
934122 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1123 ms
935712 : WIFI : Connecting clumsy_ap2 attempt #7
936042 : EVENT: WiFi#Disconnected
936106 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 169 ms
938745 : WIFI : Connecting clumsy_ap2 attempt #8
939079 : EVENT: WiFi#Disconnected
939138 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 131 ms
941778 : WIFI : Connecting clumsy_ap2 attempt #9
947130 : EVENT: WiFi#Disconnected
947189 : WIFI : Disconnected! Reason: '(15) 4way handshake timeout' Connected for 5140 ms
947725 : WIFI : Connecting clumsy_ap2 attempt #10
948976 : EVENT: WiFi#Disconnected
949035 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 1121 ms
951805 : WIFI : Connecting clumsy_ap2 attempt #11
952140 : EVENT: WiFi#Disconnected
952199 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 134 ms
955778 : WIFI : Connecting clumsy_ap2 attempt #12
956115 : EVENT: WiFi#Disconnected
956174 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 142 ms
959734 : WIFI : Connecting clumsy_ap2 attempt #13
960064 : EVENT: WiFi#Disconnected
960123 : WIFI : Disconnected! Reason: '(203) Assoc fail' Connected for 129 ms

clumsy-stefan on 5 Jan 2019

@TD-er tryConnectWiFi() is returning a trueor falsein case the connection was succesfull or not... however the WiFiConnectRelaxed() actually never checks for this...

is this function somehow from before event-based WiFi? it seems like it never reaches the last 2 lines in that function...

clumsy-stefan on 7 Jan 2019

Yes it was some kind of left-over from before the event-based wifi.
I think we really should consider having a good look at that WiFi code again, since it isn't as structured as it should be.

TD-er on 7 Jan 2019

ok... I'm still debugging what exactly happens due to the 4way handshake timeout and why the node won't reconnect again.... but I think I'm still poking a bit in the dark however finding small bits here and there which could add up though....

clumsy-stefan on 7 Jan 2019

one other small one seems to be in WifiCheck()... in there checking for layer 2 connectivity is only done when IP is not valid anymore (eg. all octets are 0). This could lead to the situation that layer 2 is dis-(or re-)connecting/handshaking, etc. but the IP is still valid as it's not yet expired (DHCP). That's probably the cause why the "DCHP renew probably failed" only happens after a long time, when the lease is actually gone... but I'm still verifying this...

clumsy-stefan on 7 Jan 2019

also there is wifiCheck(), WiFiConnected() and connectionCheckHandler() which all do some kind of connection checking, and call each other as well as resetWiFi() under certain conditions... especially connectionCheckHandler()seems only to be called when mqtt_reconnect_count > 10. So what happens in a non-MQTT environment?

PS: I'm just documenting my findings here searching for the underlaying WiFi troubles...So I'm happy for any thoughts on it, but not neccesarily expected...

clumsy-stefan on 7 Jan 2019

👍1

there has to be some kind of mysterious race-condition somewhere. when the AP initiates a reauth or rekeeying sometimes the node/core either does not get enough cpu-time to complete the handshake or it gets interrupted while doing so, especially on nodes with low signal (and therefore the handshake taking much longer)... (see below).

these rekeesings/reauths seem to generate a disconnect event by the core, even though the core would do a auto renegotiation or reconncet (as I understand). but then it seems this handshake process gets interrupted by the ESPEasy state machine as a manual reconnect is triggered... this process repeats itself and never ends (or in some cases generates wdt's)..

`810114 : EVENT: WiFi#Disconnected 810146 : WIFI : Disconnected! Reason: '(16) Group key update timeout' Connected for 13 m 16 s 810977 : WIFI : Connecting clumsy_ap2 attempt #0 811081 : WIFI : Connection lost to: clumsy_ap2 813089 : EVENT: WiFi#Disconnected 813120 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2011 ms 814977 : EVENT: Clock#Time=Thu,08:55 821529 : WD : Uptime 14 ConnectFailures 0 FreeMem 22000 WiFiStatus 0 831976 : WIFI : Connecting clumsy_ap2 attempt #1 832079 : WIFI : Connection lost to: clumsy_ap2 836831 : EVENT: WiFi#Disconnected 836863 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 4753 ms

clumsy-stefan on 10 Jan 2019

👍1

So maybe we should use the wifi events only to monitor the process, not take action ourselves?
Changing this will render 2.3.0 and perhaps 2.4.0 builds unusable though.

TD-er on 10 Jan 2019

No, I would'nt do that... In my branch I already did a couple of (not so intrusive) changes which seem ta make these cases occur less often...

just found another small thing: in tryConnectWiFi() the WiFi.status() check towrds the end after the WiFi.begin() call starts to return WL_DISCONNECTED. According to the Documentation this means "if module is not configured in station mode". trying to find out why this happens or actually if it helps to put the esp explicitely in AP mode before calling WiFi.begin()

So I still hope to find the "real" underlying issue why these stalls happen... If so (fingers crossed) I would suggest, that I do a PR with all changes for you (and others) to review....

clumsy-stefan on 10 Jan 2019

Please know that I have also seen lots and lots of situations where the state of WiFi.status() was not correct.
So maybe the core libraries now have fixes and very likely I messed up also somewhere in all the attempts to get the WiFi to behave stable.
That debugging was very hard to do, since I cannot reproduce these issues myself and had to act on reports by other users.
Lately I have a module which is also behaving badly regarding WiFi, so that's my favorite WiFi test module. But it may also be an indication the problem will be made worse when some parts are just close being out of spec. For example power supply, or missing decoupling capacitors, (too) thin PCB traces, less shielding, etc.

TD-er on 10 Jan 2019

sure, not that I'm ruling HW-issues out. But I have ~40 nodes running, with all different kind of power supplys, different boards, brands, etc... and at some point in time most of them face connectivity issues... especially the ones with weak wifi coverage or lots of GPIO's in use..

And I currently do have a bit of time to do some debugging and I still find it very interesting and challenging ;) By now I even start to understand how the whole sequence of connecting, checking, disconnecting and so on works ;)

So if it's OK for you I'll keep digging deeper in these connectivity issues.... you're the boss though....

clumsy-stefan on 10 Jan 2019

👍3

Please continue digging :)
I really want to be freed from all those disconnect issues which are next to impossible to reproduce.
They already have taken way too long now and it would be really great if they are fixed.

TD-er on 10 Jan 2019

I'm getting about 4-24 hours of uptime followed by 2-10 hours of downtime on average.
During downtime the node continues to work, but there is no wifi connection.

The accesspoint (MikroTik) shows:

18:16:15 wireless,info 80:xx:xx:xx:xx:xx@iotnet: connected, signal strength -62 
18:16:20 wireless,info 80:xx:xx:xx:xx:xx@iotnet: disconnected, unicast key exchange timeout 
18:17:31 wireless,info 80:xx:xx:xx:xx:xx@iotnet: connected, signal strength -60 
18:17:36 wireless,info 80:xx:xx:xx:xx:xx@iotnet: disconnected, unicast key exchange timeout

ESP Easy | Information |
-----|-----|
Build:⋄|20103 - Mega|
Libraries:⋄|ESP82xx Core 2_4_2, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3 PUYA support|
GIT version:⋄|mega-20190110|
Plugins:⋄|7 [Normal] [Sonoff POW R1/R2]|
Build time:⋄|Jan 10 2019 03:21:19|
Binary filename:⋄|ESP_Easy_mega-20190110_hard_SONOFF_POW_4M.bin|

0ki on 11 Jan 2019

This was not the problem on the older version (when I still could use channel 14 and have my hidden-ssid network).

0ki on 11 Jan 2019

Do you have the WiFi channel fixed, or is it variable?
I've been reading on the trouble-shooting page of Tasmota and they strongly advice to have the WiFi channel fixed.
If channel 14 is not usable, then something regarding regional settings may have changed somewhere, since only channel 1-11 are allowed in all countries.
Also channel 1, 6 & 11 are actually the only channels usable without interference from other channels.

TD-er on 11 Jan 2019

only channels 1-10 work, not 11. See this: https://github.com/letscontrolit/ESPEasy/issues/1337#issuecomment-394118989

Wifi channel is fixed, of course.

Any channel can have interference from other channels, no way to control it. Hell, there can even be interference from the same channel.

btw, regarding the issue I'm facing - wifi status led (inversed GPIO13) is normally solid blue, but when the device is offline, it goes in the pattern 0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,0,0,0,1,2,3,4,5,6,7,8,9,0,0,0,0,0,... of brightness.

0ki on 12 Jan 2019

Fixed IP and DCS, but the channels never changed in the past.

Von: Gijs Noorlander [mailto:[email protected]]
Gesendet: Freitag, 11. Januar 2019 21:47
An: letscontrolit/ESPEasy
Cc: Subscribed
Betreff: Re: [letscontrolit/ESPEasy] [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects (#1987)

Do you have the WiFi channel fixed, or is it variable?
I've been reading on the trouble-shooting page of Tasmota and they strongly advice to have the WiFi channel fixed.
If channel 14 is not usable, then something regarding regional settings may have changed somewhere, since only channel 1-11 are allowed in all countries.
Also channel 1, 6 & 11 are actually the only channels usable without interference from other channels.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/1987#issuecomment-453651579 , or mute https://github.com/notifications/unsubscribe-auth/AomgSxPjVKWrp0MSQTtbESKxFqaPVt90ks5vCPgxgaJpZM4YDivP the thread. https://github.com/notifications/beacon/AomgS_fhHQnmg1AdpX11mVL9yNra8S_kks5vCPgxgaJpZM4YDivP.gif

NorbertRoller on 12 Jan 2019

Mikrotik stuff give great power but even default settings for wifi are not right.

Make sure you dont have in security settings tkip and enable only aes ccm encryption
Enter a key group Update like 30 min or 1 hour
in channels have channel width 20Mhz and in ny opinion dich the B band and go just for G/N
Extension Channel disabled
Only disabling PMKID (recommended if you are hysterical about security) seems to make the Esp veeeery unhappy
Add your country to wifi settings (or see what is your max power for the AP you have and lower the transmission power by 3). This very often boosts wifi performance (counter intuitive, I know)

I don’t generally have disconnects (except when i had this PMKID thing)

It would be insightfull anyway to post your AP settings anyway :-)

jimmys01 on 12 Jan 2019

Those are good tips. I am quite well versed with networks though and even more experienced in mikrotik and I've configured all as you say except group key.

Group key update only has to do with the group key, not wit the unicast key where the problem lies as well, but I've downgraded that setting for the purposes of the experiment.

0ki on 12 Jan 2019

@0ki Cool didn't know you where an expert on RouterOS!

jimmys01 on 12 Jan 2019

I've concluded that it's clearly a bug in the ESPEasy software. You can see the problematic behaviour on the right that matches with RouterOS disconnecting the misbehaving client on the left. The radio spectrum is clear as you see.
test

Unfortunately I don't have the time to delve into ESPEasy codebase right now, but I trust this comparison of bad vs good communication will be of use.
Legend: __red - access point | blue - espeasy | green - another device on my iot network__
test2

I believe that ESP_easy for some reason decides to switch to AP mode after receiving assoc response (frame 380 on left) and not continue the four-way handshake. Note the change of espeasy macaddress as well - the first octet changes from 80 to 82.

0ki on 12 Jan 2019

👍1

@0ki

I've concluded that it's clearly a bug in the ESPEasy software.

Me too, but I have not an idea where the bug should be.
It seems like ESPeasy 's internal state concerning disconnected state is not in sync with the true state.
It is very well possible I've made some error in writing that code, or it is a bug in the core libraries.

TD-er on 12 Jan 2019

Which file specifically do you mean by "that code"? If you point to a file or two, I could take a look at it.

Otherwise - since now you see the exact moment where it happens, I hope you can take a closer look at it.

0ki on 12 Jan 2019

As I'm olso digging into this issue (for about 2 months now) I'm not completely sure anymore if it's the ESPEasy code itself. more an interaction between ESPEasy and the WiFi core/library... My debugs/logs show that the "Disconnect" always happens after a group key exchange / rekeeying. BUT the callback is only called after the rekeying timeout and subsequently after the authentication failed... So curently my guess is, that ESPEasy does not leave enough cpu-time to the core to actually do the rekeeying... up to now I didn't find a way though to find out when the node is in this rekeeying node or in the reauth mode and therefore beeing able to add enough delay() calls so the rekeeying reauth is succesfull...

clumsy-stefan on 12 Jan 2019

To help pinpoint the problem timewise - it was working ok with:

Build   20100 - Mega (core 2_3_0)
GIT version mega-20180224
Plugins 48 [Normal]
Build Md5   719b94a4d6bc257b86916e4989eed3a0
Md5 check   passed.
Build time  Feb 24 2018 03:03:12

And by ok I mean that not only there were no disconnect, but also that connecting to hidden AP on channel 14 was not a problem.

0ki on 12 Jan 2019

@clumsy-stefan, I fully support catching the problem that causes it do disconnect in the first place, but honestly, there may be a number of reasons why a wifi assoc may drop, of course a group key exchange shouldn't be one of them.

The more pressing problem is - why can't it reconnect after it has disconnected. It can't be a timing issue, as I'm giving espeasy a whopping 5000ms to reply with message 2 of the 4way HS.

0ki on 12 Jan 2019

I think the issue is, that you only get the disconnect AFTER the keyexchange fails... so you never now when the exchange actually starts, that's where you would need to give more time to the core...

in the logs you can see that ESPEasy thinks for quite some time it's still connected, even if it's not (even a ping to the node fails), but the node still tries to send data (and obviously connection fails then)...

the second issue I agree completely is, even if it gets a disconnect, why can't it reconnect with a completely new connection sequence... probbaly the module or core is stuck in some strange state?

EDIT: PS: the 4way HS does ot always fail, (I do it every 10min.), so it seems to be a combination of the state / busyiness of the ESPEasy and the time then the HS happens..

clumsy-stefan on 12 Jan 2019

What I can do, is try to completely drop WiFi (turn off modem also) and start a complete fresh reconnect when an unexpected disconnect is happening.
Not sure if it was in this thread, but in one of these issues, someone replied with a tweet of someone at Shelly, who mentioned they had to restart the wifi stack completely after disconnect.

TD-er on 12 Jan 2019

@TD-er that's also what I put in in my latest version. within processDisconnect() I added a setWifiMode(WIFI_OFF); not sure though, if that's enough... currently it's running on some test nodes (since about 1h)...

clumsy-stefan on 12 Jan 2019

@TD-er that's a cool idea. push a git release and I'll flash my stuff!

I just started reading some of the codebase 5 minutes ago, so this may be irrelevant, but: Also make sure that wifi_connect_attempt is set to 0 at that point.

0ki on 12 Jan 2019

hmm... adding some delay(0) in backgroundtasks() after each of the subroutine calls, seems to further stabilize the first issue a bit (4way-HS). Not sure if it is a coincidence tough...
@TD-er could it be, that backgroundtasks() is blocking execution at some point in time?

clumsy-stefan on 13 Jan 2019

generally speaking: adding delay(0) in all kind of different places where the timingstats show long processing times seems to improve the 4way-HS handling a lot. I have more SW/HW watchdogs kicking in now than rekeeying issues (but still have some)... symptomatically so I would say it's really related to the core/wifi part not getting enough cycles to actually do (quite swift) things like rekeeying and the AP then getting tired of waiting for an answer...

What I also see is that sometimes the client.connect() or the sendData() can take quite a lot of time (up to 2sec). In some documentation I found that if something takes more than 50ms, you should call delay(0) inbetween.... but you can't split these calls as they are performed by the library...

There is also another callback onStationModeAuthModeChanged in the core. Probably it would be worth looking at, as this one is probbaly triggered before the actual disconnect happens. but it's very little documented... Anyone have experience with this?

EDIT: it seems to be a coincidence/race condition with some other things beeing done at the time when the AP requests a rekeying.... but again just observations...

clumsy-stefan on 15 Jan 2019

I don't know how often these requests sent by the AP are retransmitted.
Normally the beacon signal from the AP is sent every 102.4 msec. If some tasks takes more than that, the ESP misses one of such beacon frames. So I'm not sure how bad this is.
I do know that ARP requests are missed sometimes, which leads to issues that a switch does not know anymore how to route packets to that MAC address and thus renders the ESP unreachable, while it can still send data itself.

TD-er on 15 Jan 2019

missing beacon frames is nothing. it should not even need beacon frames as it should be actively probing for my hidden network anyway.

0ki on 15 Jan 2019

I am no expert in WiFi.
As I understood it, the beacon interval is the only -somewhat- guaranteed moment the ESP is trying to listen to the WiFi and inbetween it will try to sleep as much as possible.

And as I understood it, the only difference between 'hidden' networks and 'not hidden' networks is that the SSID is not being broadcast. So in fact it is connecting only based on BSSID (MAC-address of AP).
That's also what the ESP does when connecting to (not hidden) WiFi networks, with the only difference that it will perform a scan before and try to find a BSSID which has a SSID which matches the given one.
When performing an automatic reconnect, it will try the last known BSSID + channel combination first without performing a scan. In other words, it only differs when making a connection.

So to my knowledge, as long as there is a connection to a network, it will also listen to the beacon frames.

TD-er on 15 Jan 2019

I am no expert in WiFi.
As I understood it, the beacon interval is the only -somewhat- guaranteed moment the ESP is trying to listen to the WiFi and inbetween it will try to sleep as much as possible.

And as I understood it, the only difference between 'hidden' networks and 'not hidden' networks..........

Oh that reminds me something that may be quite important. I run the my ESPs in a hidden SSID
By the way still no disconnects or reboots on my nodes (trying 5 min group key updates)

jimmys01 on 15 Jan 2019

One of my nodes was rebooted at December 5th due to a power-outage.

This is from that one:

As you can see, on average more than one disconnect a day. It is connected to an AVM Fritzbox 1750E

TD-er on 15 Jan 2019

I think the beacon frames are not so critical... I'm still not clear why after a disconnect the normal WiFi.begin() sequence always fails with a '(2) Auth expire'.

The only indication is, that CPU Load at this point always shows 100% and hints that the core just does not get enough time to do the handshake...

clumsy-stefan on 16 Jan 2019

My guess is that there's something wrong in the core libraries (LWIP or Arduino or ESP) and we should reset the wifi.
Thus turn off wifi and start all over again. Maybe also wait between those steps to make sure buffers are emptied or at least flushed and outstanding requests are actively been cancelled.

TD-er on 16 Jan 2019

@TD-er can you push this change , so I can see if it works?

0ki on 16 Jan 2019

The change discussed in the last 2 posts?
It hasn't been implemented yet, but I can try this evening to code it and make a test build.

TD-er on 16 Jan 2019

Yep, resetting the wifi.

0ki on 16 Jan 2019

@TD-er wrote:

Thus turn off wifi and start all over again. Maybe also wait between those steps to make sure buffers are emptied or at least flushed and outstanding requests are actively been cancelled.

Not sure aout flushing... since I removed all client.flush() calls before the client.stop() It also seemed that the disconnect happens less often. While reading some documentation I saw that the flush emties all (tcp)input buffers, so theoretically it could happen that replies get lost...

clumsy-stefan on 16 Jan 2019

Just adding my 2c, I have a few Mikrotik APs, with a few sonoffs running espurna and a few NodeMCUs running ESPEasy. I had to add an extra AP for the ESP devices, because during a high traffic period in my network, like backups, I would lose my ESP devices. I think this was WiFi signal strength and traffic related. Once the extra AP was added, I don't remember seeing this problem again.

But my NodeMCUs and ESPEasy devices do have some disconnects. I am going to setup a netdata ping monitor to devices and see if I can figure out when, how and why this might be happening and see if I give some info back.

LeeNX on 17 Jan 2019

Another note, I don't have 40+ devices like @clumsy-stefan , only about 5 users and their mobile devices and the odd smart device like TV, but could it be that either the Mikrotik device is overloaded or the WiFi RF network saturated?

I wish I had a way to see the amount of RF noise. I have an app (WiFi Analyzer) that shows me the number of APs and how they spread of the WiFi channels. I used this tool to set my WiFi channels for different WiFi AP's.

But can't really tell if there might be other RF interference.

As for the idea of overloaded Mikrotik device, I am using MikroTik 951UI-2HND for my main router and MikroTik RB9412nD hAP lite as my ESP AP.

I wonder if that info is useful in any way?

LeeNX on 17 Jan 2019

@LeeNX thanks for the hints. my nodes are distributed over 3 AP's and the RF is somewhat balanced (max. 20 clients per AP or so), so that there are as little interference as possible (of course there can't be none...). So I don't think the MT's are overloaded (I work in a company providing network backbone connectivity and WLAN for professional events, etc. so I have quite some in depth tests of the infrastructer), but I agree that overloaded air-time could be an issue... however even if so, a node never beeing able to reconnect for >250 times after a disconnect and then after rebooting it it reconnects immediately I think can't be an issue of airtime... besides I don't see that on any other device...
so I still think it's some coincidence or race condition between the state of the internal wifi chip, the amount of cpu-time the wifi core gets and what other tasks are running on the ESPEasy part...

clumsy-stefan on 17 Jan 2019

@clumsy-stefan I concur with your insights. I just wanted to point out that using the little Mikrotik AP with 40+ nodes, could have been a Mikrotik CPU or RAM limited, but that does not seem to be the problem.

I am wondering if dpeloying a ESP32 and NodeMCU device with nothing other than the base ESPEasy core and let that run, so that we could test ESP CPU and no other plugin influence. See what I can deploy and netdata ping test.

If this gives us any useful info, I will report back.

Thanks Guys!!!

LeeNX on 17 Jan 2019

I do have a node here running next to nothing (IR RX/TX) and it is running for weeks without an issue.
The other node which has an uptime of over 40 days is running Domoticz MQTT and BME280 + MH-Z19 CO2.
So it will probably also depend on which plugin is running and how often and maybe also if the read-interval always matches the read-interval of other plugins.

I already have set some periodical function calls (10/sec, 1/sec and 30sec) with an initial offset in the scheduler so they keep running as much as possible in different loop calls.

Maybe I should also add some interrupt timer driven event like Tasker does (or simply use tasker instead of the scheduler I wrote) and set one task at 10 msec interval to run delay(0).
Only thing is, it may interrupt other GPIO transfers and thus disturb sensor readings.

TD-er on 17 Jan 2019

that would probably solve the 4way-HS issue, but I guess not the reconnect problem... that's still the most strange one... I just had my test node failing again for 12 hours trying to reconnect for >300times to the AP without success. after (soft-)rebooting it, it reconnected instantly (reboot+reconnect took some 5sec. or so)...

clumsy-stefan on 17 Jan 2019

I just found this issue which looks quite related to what we're seeing here:
https://github.com/esp8266/Arduino/issues/5527

TD-er on 17 Jan 2019

looks similar, yes... but calling wifi OFF does not seem to change it (tried that), currently I'm trying with WiFi.setOutputPower(0) before calling WiFi.begin() as suggested here
https://github.com/esp8266/Arduino/issues/2235
and here
https://github.com/esp8266/Arduino/issues/2186#issuecomment-233853152
still waiting for it to happen now tough...

clumsy-stefan on 17 Jan 2019

it's similar but it's not the same. Infact in our case rebooting the AP solves the issue, in the issue posted by Gijs, rebooting the AP does not solve the issue.

giig1967g on 17 Jan 2019

@giig1967g rebooting the AP does not solve it for me! only rebooting the node makes it reconnect again (in my environment though)...

clumsy-stefan on 17 Jan 2019

@giig1967g In the issue mentioned by @clumsy-stefan this reboot of the AP is also mentioned: https://github.com/esp8266/Arduino/issues/2235#issuecomment-248851270

So it seems we have several issues here at hand.

TD-er on 17 Jan 2019

@clumsy-stefan
Ah ok. And in my case only with Mikrotik... :(
@TD-er
maybe you should ask which AP brand/model they are using...

giig1967g on 17 Jan 2019

@giig1967g I also had some nodes here being unreachable, but the interval of those events was much more than 10 days, so it is a bit hard to say if it is related to one of these issues.
About half of my nodes now are connected to a Mikrotik and the other half is connected to Fritzbox 1750E

TD-er on 17 Jan 2019

@giig1967g I also only have MT's... and it seems to happen more often to the nodes with weaker signal and nodes with GPIO's in use (specifically PCF's)...

clumsy-stefan on 17 Jan 2019

I wish I had a way to see the amount of RF noise. I have an app (WiFi Analyzer) that shows me the number of APs and how they spread of the WiFi channels. I used this tool to set my WiFi channels for different WiFi AP's.

But can't really tell if there might be other RF interference.

You can do /interface wireless registration-table print interval=1 on your mikrotik to see the snr.

0ki on 17 Jan 2019

unfortunately the WiFi.setOutputPower(0) did not help either... after nearly 6 hours smoothly running (without any changes to the wifi infrastructure, suddenly it get's into the deadlock again (see serial output below) from which it never ever recovers, except if by coincidence a SW or HW WDT kicks in...

this only seems to happen after a "Group key update timeout" or a "4way handshake failed". if I kick the node off the AP manually it also gets a "auth expire" but can reconnect instantly...

serial output (nothe the load goes to 100% then)...

20457441 : WIFI : Disconnected! Reason: '(16) Group key update timeout' Connected for 2h19m
20457593 : WIFI : Connecting clumsy_ap2 attempt #0
20458607 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20459707 : EVENT: WiFi#Disconnected
20459763 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 2012 ms
20470762 : SYS  : 31.00,20048.00,66.70,341.00
20470766 : EVENT: sysinfo#rssi=31.00
20470824 : EVENT: sysinfo#mem=20048.00
20470882 : EVENT: sysinfo#load=66.70
20470940 : EVENT: sysinfo#uptime=341.00
20472658 : EVENT: Clock#Time=Thu,14:22
20473264 : WD   : Uptime 341 ConnectFailures 22 FreeMem 20040 WiFiStatus 0
20477609 : WIFI : Connecting clumsy_ap2 attempt #1
20478135 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20482577 : EVENT: WiFi#Disconnected
20482635 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 4771 ms
20503270 : WD   : Uptime 342 ConnectFailures 22 FreeMem 18440 WiFiStatus 0
20505709 : WIFI : Connecting clumsy_ap2 attempt #2
20506235 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20509784 : EVENT: WiFi#Disconnected
20509845 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3879 ms
20530897 : SYS  : 31.00,16248.00,100.00,342.00
20530902 : EVENT: sysinfo#rssi=31.00
20530963 : EVENT: sysinfo#mem=16248.00
20531025 : EVENT: sysinfo#load=100.00
20531087 : EVENT: sysinfo#uptime=342.00
20532688 : EVENT: Clock#Time=Thu,14:23
20533158 : WD   : Uptime 342 ConnectFailures 22 FreeMem 16240 WiFiStatus 0
20533716 : WIFI : Connecting clumsy_ap2 attempt #3
20534242 : WIFI : Not configured in Station Mode!!: clumsy_ap2
20537788 : EVENT: WiFi#Disconnected
20537851 : WIFI : Disconnected! Reason: '(2) Auth expire' Connected for 3865 ms

clumsy-stefan on 17 Jan 2019

just a quick remark: adding wifi_set_phy_mode(PHY_MODE_11G); and forcing the node into 802.11g mode seems to be at least a workaround. I have no reboots of any of the units since a couple of hours and also once or twice a unit recovered from the group key timeout issue above...
probably related to

• ESP8266 SoftAP only support 802.11b/g.

which is noted in the docs...

I'll update later/tomorrow if it stays stable...

clumsy-stefan on 19 Jan 2019

👍1

As already noted (a few times) by someone else, the sensitivity should also be increased by switching to 802.11G.
My objective against it is (was?) that most AP have issues when switching between N, G and B.
Do you have mixed clients? (G and N) active on the same AP?

TD-er on 19 Jan 2019

I use different modes on all Mikrotik. B/G/N/AC mixed, also both bands 2/5GHz including virtual AP's defined... Different clients connect with different modes on the same AP, without issues (except the esp8266)..

clumsy-stefan on 19 Jan 2019

good news (at least for me)... my nodes ran for the last 12 hours without disconnects/reboots/freezes!

forcing the node into 802.11g seems to work smoothly together with the MikroTik AP's. at least this is a simple workaround (as long as you don't need the additional speed of 802.11n).

if someone want's to test themselves, just add in ESPEasyWifi.ino around line 644 in function tryConnectWiFi() the following line (after setupStaticIPconfig();):
WiFi.setPhyMode(WIFI_PHY_MODE_11G);

I can only guess that the issues raises because the SoftAP mode does not support 802.11n and the node is connected in N mode the core gets "confused" somehow... but I guess this is an issue to raise with the core-people...

For ESPEasy this could be made configurable I guess (@TD-er ?) so that it could be choosed on the config page which mode the ESP should run in (B/G/N)...

clumsy-stefan on 20 Jan 2019

👍1

Yep I will add a selection option in the advanced settings.
I forgot who asked it first, but will look it up and credit him too in the commit ;)

TD-er on 20 Jan 2019

perfect!! I don't need the credit I just need it to work ;) so feel free to dedicate it to him!! 😄

While debugging I found some other small things which I would suggest to change (like calling delay() unconditionally after plugin calls, some additional {} and some additions to the logging output. should I do a PR for these ASPIT or would you want to keep it as it is (I think these are not vital changes just some minor improvements probabaly)?

clumsy-stefan on 20 Jan 2019

Please make a PR.
It would at least help the discussion and keep track of your findings.
Also adding comments to what you have checked would be helpful

TD-er on 20 Jan 2019

For reference #2012
All credits to the dev. team!

Domosapiens on 20 Jan 2019

Does this patch also include the WiFi.setOutputPower(0) idea?

I've been running an experiment for 7 days now.
mega-20180224 node - zero reboots
mega-20190110 node - 35 reboots; 9 of them today.

0ki on 20 Jan 2019

I left the relevant line in the code but commented it out. So you would need to remove the // (about line 644) in ESPEasyWiFi.ino and recompile it!

clumsy-stefan on 20 Jan 2019

@oki sorry, I just saw I read your question wrongly... No this is not included as I could see no difference when setting power to 0 before reconnecting... So I discarded that one again!

clumsy-stefan on 20 Jan 2019

I've been running an experiment for 7 days now.
mega-20180224 node - zero reboots
mega-20190110 node - 35 reboots; 9 of them today.

That's effectively comparing:

core 2.3.0 without event based wifi
core 2.4.2 with event based wifi.

TD-er on 20 Jan 2019

good news (at least for me)... my nodes ran for the last 12 hours without disconnects/reboots/freezes!

forcing the node into 802.11g seems to work smoothly together with the MikroTik AP's. at least this is a simple workaround (as long as you don't need the additional speed of 802.11n).

I did it the other way around for testing.
I set my Mikrotik to N-Mode only and well... all ESPs are running since >12h without a single reboot.

chunter1 on 21 Jan 2019

good and valid approach too! Unfortunately I can't do this, as I do a a number of clients without N capability...

clumsy-stefan on 21 Jan 2019

Just disable B mode and enable G/N only. You are better off without the B mode. Unless you have old phones or (typically) wifi barcode scanners from 10 years back that only support B mode and rates.

The added sensitivity of the antenna in B mode is also not going to play any significant role in range. B mode just eats up airtime. There is literally no drawback in not using B mode and no drawback in using N mode over G also. N mode has even better range that B and G as well.

jimmys01 on 21 Jan 2019

@jimmys01 you're right, B mode is probably not used anywhere anymore... but I can't go for only N though... therefore it would be up to test what happens in G/N mode if the nodes stay stable and are able to reconnect in case of a 4way HS timeout..

clumsy-stefan on 21 Jan 2019

g/n is unstable for me.

0ki on 21 Jan 2019

Does this mean that the esp8266 runs into trouble each time it transitions between a mode (B/G/N)?

chunter1 on 21 Jan 2019

Does this mean that the esp8266 runs into trouble each time it transitions between a mode (B/G/N)?

Very likely.
I've seen a lot of issues with some devices (not ESP ones) only supporting G mode.
If you use them mixed with N devices, they can appear to be unreachable.
Notorious are the HomeWizard and some WiFi cams.

TD-er on 21 Jan 2019

With this new version mega-20190121 the ESP seems to be more stable, I did not have the crash/Wlan problems till now.
But now I can't switch off the OLED (framedPlugin) anymore. When I send OLEDFRAMEDCMD,off it switches the display off for a very short moment and than turns on again.
I did a back to back test with an old Version (from 2018) the display reacts as expected.

kischde on 23 Jan 2019

Just tried out mega-20190121 and WiFi stability is the same. It crashes within about an hour from starting.

@TD-er, when do you think you could push a version with the change discussed so I can test it out?

0ki on 23 Jan 2019

quick update: since setting the mode fix to 802.11g I have no freezes, reboots, etc. anymore! even though the AP's operate in mixed g/n mode... so I'd also vote for a patch that makes wifi-mode configurable on the esp!

clumsy-stefan on 23 Jan 2019

Hope for a patch soon too 😉

chunter1 on 23 Jan 2019

I just added a commit for selecting the B/G only.
It is not yet working well on ESP32, so I disabled the option to select it for ESP32.
I also added a fall-back option to be able to connect again if the AP is not allowing B/G only.

TD-er on 24 Jan 2019

@TD-er I mean this change:

My guess is that there's something wrong in the core libraries (LWIP or Arduino or ESP) and we should reset the wifi.
Thus turn off wifi and start all over again. Maybe also wait between those steps to make sure buffers are emptied or at least flushed and outstanding requests are actively been cancelled.

The change discussed in the last 2 posts?
It hasn't been implemented yet, but I can try this evening to code it and make a test build.

0ki on 24 Jan 2019

@oki I tried both variants, setting output power to 0 and actively disconnecting/reconnecting when the issue occurs. both versions did not help to reset/reconnect to the AP... (in my environment)... mre details above...
the only thing which made it stable was to to use N-Mode.

clumsy-stefan on 24 Jan 2019

I'm well aware of that. Hoping for a patch soon, so I can continue my debugging.

0ki on 24 Jan 2019

@0ki I can make a testbuild for you using this latest commit.
Or do you also want to test with the disabling of WiFi?

And if so, what options do you want/need? WiFi off/on and restart all services?
That last part (restart services) is also one thing to have a good look at, since I am not entirely sure all services using a socket are properly restarted.
That may also cause some issues I guess when the WiFi connection is recreated.

TD-er on 24 Jan 2019

Automatically fully turning off (and then on) the wifi transmitter when a disconnect is detected.

0ki on 24 Jan 2019

I see that you've added some code. Which version can I pull to test this out?

0ki on 1 Feb 2019

Not sure, I will make a new build for testing.
It will be done in +/- 30 minutes.

TD-er on 1 Feb 2019

test build
It is based on what I merged earlier today and PR #2235 which has the changes for connecting in B/G mode.
There seem to be issues with some plugins in that PR, so that's the reason it hasn't been merged yet.

TD-er on 1 Feb 2019

Thanks! Flashed the test build.

I have the following settings now:
Connection Failure Threshold: 0 Force WiFi B/G: NO Restart WiFi on lost conn.: YES

Do you think I should test something differently considering I have PLATFORMIO_ESP12E?
That is, I forogt if b/g is gonna work on esp12e and what connection failure threshold was exactly.

0ki on 2 Feb 2019

That threshold is just a counter to initiate a reboot if a number of unsuccessful connect attempts is exceeding that value.
0 is the default and means it will not perform a reboot.

I guess setting it to 50 or 100 will help you to reboot it when it may be at a hard to reach place and it gets into some reconnect loop.

Even with this setting, it may still show HW watchdog reboots, as a number of my nodes already showed.
It may just be harder to reproduce those with the B/G mode active.

TD-er on 2 Feb 2019

test build
It is based on what I merged earlier today and PR #2235 which has the changes for connecting in B/G mode.
There seem to be issues with some plugins in that PR, so that's the reason it hasn't been merged yet.

Thanks for the test build.
My previously most unstable ESP12F is now up since 59 hours without a reboot :)

EDIT: Settings are:
Force WiFi B/G: YES
Restart WiFi on lost conn.: YES

chunter1 on 4 Feb 2019

I think I will split that commit (B/G WiFi) to a separate PR, so it can be merged before I fix the rest of the PR in which it is now waiting to be merged.

TD-er on 4 Feb 2019

👍3

Good idea!

My previous settings RestartWifi=YES did nothing useful.

Now I switched to

Connection Failure Threshold: 0
Force WiFi B/G: YES
Restart WiFi on lost conn.: NO

and it's been up two days without reboot or even a single disconnect. Amazing. I don't think I'll even be moving away from this test release.

0ki on 4 Feb 2019

After approx. 100 hours the modul finally did a reset. :(

Reset Reason: Hardware Watchdog

However this may also be caused by the challenging config (12 x 1-wire sensors and a FHEM server).

chunter1 on 6 Feb 2019

You could also try to add this patch: https://github.com/letscontrolit/ESPEasy/pull/2235/commits/9a05eaf828a737ae416ab11c8df8c4a5b03ceaf2
Maybe that will also help improve stability.
It is included in this test build

TD-er on 6 Feb 2019

You could also try to add this patch: 9a05eaf
Maybe that will also help improve stability.
It is included in this test build

Thanks!
Installed it on two completely different configured units and will give feedback.

chunter1 on 6 Feb 2019

had a run on one of my test nodes including
https://github.com/letscontrolit/ESPEasy/commit/9a05eaf828a737ae416ab11c8df8c4a5b03ceaf unfortunately it took only about 2 hours until running into the group-key-timout issue again and not recovering or ever reconnecting back to the AP. With "restart wifi on connection lost" enabled and not forcing it to B/G.... so N still seems to be an issue..
So for me B/G mode is still the only stable working option.

clumsy-stefan on 7 Feb 2019

Could we look at dropping N support altogether? Do we need the benefits provided by N for such a simple device?

0ki on 8 Feb 2019

@0ki with the "new" option to force B/G mode that @TD-er built in, it is basically dropping N at software level, you can't remove it from the core libs I assume...

clumsy-stefan on 8 Feb 2019

When moving this into main settings I recommend this option is shipped ON by default.

0ki on 8 Feb 2019

👍1

We could make it dynamic.
If no connection is possible then try again with the N option on.

There are lots of setups which have N-only checked in the AP. So in those environments you cannot go back if you're disabling N support on the ESPeasy.

TD-er on 8 Feb 2019

👎2

You could also try to add this patch: 9a05eaf
Maybe that will also help improve stability.
It is included in this test build

Thanks!
Installed it on two completely different configured units and will give feedback.

So far, one modul restarted after 2 days, the other after 5 days.
Both showed "Hardware Watchdog" as reset reason.
Both modules are set to "Force WiFi B/G: Yes" and "Restart WiFi on lost conn.: Yes".
The AP is a Mikrotik set to B/G only (uptime 49 days).
Distance between AP and modules approx. 2...3 m.

chunter1 on 11 Feb 2019

I was wondering why my suggestion to make it dynamic (to go back to N from B/G only setting) has received 2 thumbs-down "votes".
Please elaborate why it is not a good suggestion.

TD-er on 11 Feb 2019

I was wondering why my suggestion to make it dynamic (to go back to N from B/G only setting) has received 2 thumbs-down "votes".
Please elaborate why it is not a good suggestion.

In my opinion, if somebody switches on this option he was already desperately searching for more stability and knows what this means for his AP settings.
I prefere a static solution because when i select the B/G only option, i absolutely never want to be in a situation where i think it is using B/G but again doesn't.
However, since my nodes are rebooting anyway (although much less frequent) i would love to have a real solution for the problem related to the newer core.

chunter1 on 11 Feb 2019

I get that, but you also have to understand the issue when users don't know what a setting does and end up with unreachable nodes.
So it should have some big threshold before switching back to N support.
For example the fall-back for corrupted plugins is:
After 10 unsuccessful reboots it will disable 1 plugin or controller to see if the reboot may be successful.

Something similar can also be applied to the wifi connection.
After X unsuccessful reconnect-attempts it will allow to connect with 802.11N allowed
The reception quality is better for G mode networks, so if you put X to something like 10, it is very unlikely it will connect with N instead of G. Especially if you reset this counter every time you attempt it in N mode.

TD-er on 11 Feb 2019

Hmm the last few weeks of less time for ESPeasy apparently obfuscated my memory of what I planned to implement or what has been implemented.

I just looked at the code for this and found:

void setConnectionSpeed() {
  #ifdef ESP8266
  if (!Settings.ForceWiFi_bg_mode() || wifi_connect_attempt > 10) {
    WiFi.setPhyMode(WIFI_PHY_MODE_11N);
  } else {
    WiFi.setPhyMode(WIFI_PHY_MODE_11G);
  }
  #endif

So apparently it has already been implemented to only allow N when wifi_connect_attempt is > 10.
This counter is reset to 0 as soon as there is a successful attempt to connect to wifi.

Is this an acceptable solution?

TD-er on 11 Feb 2019

Hmm the last few weeks of less time for ESPeasy apparently obfuscated my memory of what I planned to implement or what has been implemented.

I just looked at the code for this and found:
void setConnectionSpeed() {
  #ifdef ESP8266
  if (!Settings.ForceWiFi_bg_mode() || wifi_connect_attempt > 10) {
    WiFi.setPhyMode(WIFI_PHY_MODE_11N);
  } else {
    WiFi.setPhyMode(WIFI_PHY_MODE_11G);
  }
  #endif
So apparently it has already been implemented to only allow N when wifi_connect_attempt is > 10.
This counter is reset to 0 as soon as there is a successful attempt to connect to wifi.

Is this an acceptable solution?

Well.. i would say yes (hoping the bug in the 2.5.0 will be found soon).

chunter1 on 11 Feb 2019

Feedback on setting B/G only.
I have just installed the test fw on one unit that rebooted every 20 Minutes latest. Before it was connected to N with -72dBi. That should have been a good enough signal to not disconnect. I am operating enterprise grade AP's with -105dBi receiving capabilities. So again it should never drop the channel.

After a few hours with the test FW and there was no reset so far.
I will keep it running and report back.

PS: I agree that it is related to N somehow.
PSS: I am running 20 x ESP-07 nodes that had been upgraded to 4M. Time consuming but paying back when re-flashing is needed.

NorbertRoller on 11 Feb 2019

I have 120+ nodes all running on GN (mega-20190202 ) with no issues whatsoever. They are all reporting to be connected to N

jimmys01 on 11 Feb 2019

👍1

@jimmys01 How many nodes are connecting to the same AP?
And what's the brand of that AP? (Mikrotik?)

TD-er on 11 Feb 2019

Almost all of them connect to a different AP (one per hotel room) They have a switch (PIR sensor) an HTU temp and humidity sensor, and IR leds.
The AP brand is Mikrotik.
Settings are as follows:
Country is set
Control Channel: 20Mhz
Band: GN
Extension Channel: Disabled
Authentication Type: WPA2 PSK only
Encryption: aes ccm only
Group Encryption: aes ccm
Groyp Key Update: 00:01:00

jimmys01 on 11 Feb 2019

Feedback on setting B/G only.
I have just installed the test fw on one unit that rebooted every 20 Minutes latest. Before it was connected to N with -72dBi. That should have been a good enough signal to not disconnect. I am operating enterprise grade AP's with -105dBi receiving capabilities. So again it should never drop the channel.

After a few hours with the test FW and there was no reset so far.
I will keep it running and report back.

PS: I agree that it is related to N somehow.
PSS: I am running 20 x ESP-07 nodes that had been upgraded to 4M. Time consuming but paying back when re-flashing is needed.

Unfortunately the Unit continue to reset after a random time. Anything within 1-4 hours.
B/G only FW doesn‘t solve my issues

NorbertRoller on 12 Feb 2019

I was wondering why my suggestion to make it dynamic (to go back to N from B/G only setting) has received 2 thumbs-down "votes".
Please elaborate why it is not a good suggestion.

Let's say the wifi itself goes down for an hour due to external/environmental circumstances. My node(s) switch over to N and become highly unstable again (my application is sensitive to reboots - wifi doesn't have to be there 100% of the time, but node itself should be operational all the time).

I'd want a configuration for my nodes that avoid instability and never ever connecting to N does that for me right now.

0ki on 12 Feb 2019

When moving this into main settings I recommend this option is shipped ON by default.

As a solution we could have this shipped into fallback mode (B/G-then-N) by default with being switchable to B/G only or B/G/N.

As for the actual solution for 2.5.0, please try this:

#include "esp_wifi.h"
esp_wifi_set_ps(WIFI_PS_NONE);

0ki on 12 Feb 2019

I have 120+ nodes all running on GN (mega-20190202 ) with no issues whatsoever. They are all reporting to be connected to N

How do you update 120+ nodes within an acceptable timeframe??
"..no issues whatsoever..." did you check the uptime of each node and what is it?
Are all the same since last intended reboot?

chunter1 on 12 Feb 2019

Uptimes count days and all of them have 0 reconnects. Months now I never had any issues with the connectivity except when I disabled PMKID on the APs. Probably the newer core got pickier about the wifi settings or I dont know what.. I also run all of those with 2A power supplies. We plan in the next month to deploy 80 more esp8266. By the way if @TD-er you want an access to have a look at the particular setup, I am happy to give it to you. I really want ESPEasy to flurish, it has helped us a lot.

Edit: @chunter1 I update them using batch scripts

curl -# -o /dev/null --form update=@ESP_Easy.bin --max-time 40 --connect-timeout 20 --retry 1 http://x.x.x.x/update

jimmys01 on 12 Feb 2019

@jimmys01 I recently bought a Mikrotik myself, to be able to test stuff.

And just a notice about core 2.5.0 builds.
In the upcoming nightly builds, the 2.5.0 builds are not included, since there is a bug in the core 2.5.0 which makes the webserver stop responding and let the node crash.
As soon as that's fixed, there will be core 2.5.0 builds again.
In the last nightly build, there is a core 2.6.0 alpha included, which is essentially the latest code from the github of the core library so you can see for yourself.

TD-er on 12 Feb 2019

Got the 2.6.0 with B/G only running since 4 days on two modules and the first one did a reboot.
Reset reason was the usual "hardware watchdog".
What i observe since the "B/G" only mode is active is a runtime of 3..5 days before a reset occures.

chunter1 on 19 Feb 2019

Maybe also compare the number of wifi disconnects, since those disconnects seem to be related to the crashes.

TD-er on 19 Feb 2019

Both modules are in the cellar, only 2..3m away from the Mikrotik AP.
I just checked the second module running the same FW-version and it shows 0 reconnects and 4d02h uptime (no reboot since flashing).
I highly beleave that the other module, which did the reset. also had 0 reconnects before it reset.

Regarding the tasks configured on the two test modules:
The module which did reset has 12 DS18B20 sensors configured that send their value every 120s to a FHEM server.
Tho module which did not reset so far has only two gpios configured which aswell send their value every 120s to the same FHEM server.

So, AP (Mikrotik) is the same, distance (2..3mm) is the same, FHEM server is the same - just the module with the 12 temperature sensors definately resets more frequent than the one with the 2 gpios.

chunter1 on 19 Feb 2019

on my nodes this happen mainly when fhem is too slow to respond or not reachable beacause I defined a maximum retry of 100 and then the nodes reboots (which is also showing a manual reboot then).

clumsy-stefan on 19 Feb 2019

on my nodes this happen mainly when fhem is too slow to respond or not reachable beacause I defined a maximum retry of 100 and then the nodes reboots (which is also showing a manual reboot then).

Why should your nodes reboot when the retries are over 100?
(I have set retries to 0 for testing)

chunter1 on 19 Feb 2019

tools -> advanced -> connection failure threshold

to make sure that if the wifi on the modules chokes for some reason it reboots after some time...

clumsy-stefan on 19 Feb 2019

Ahh with "retries" you mean the "Connection Failure Threshold:" not the "Max retries:" in the controller settings of FHEM.
I have set the connection failure threshold to 0.

chunter1 on 19 Feb 2019

👍1

BTW: for the first time I had a node which got stuck in the 4WHS timeout issue even though "only B/G" was active... never had that before until now (self compiled, 2.5.0 core).... I'll keep watching this, if this was just a coincidence...

clumsy-stefan on 19 Feb 2019

Please note that there will be a core 2.5.1 very soon which does revert the used SDK to SDK2.2.1, since there are lots of issues with SDK3.0.0
This may also change how nodes react on wifi and maybe it will be comparable to core 2.4.2 again.
Not sure what fixes in core 2.5.0 are included which may fix some other issues we're facing.

@chunter1 About the nodes you have.
It seems like the one that rebooted was sending a lot more messages to the controller, so statistically speaking it then has a higher chance of trying to send something which gets stalled somewhere in the process.

TD-er on 19 Feb 2019

@clumsy-stefan Can you test the current code on Github? (or the latest build)

I have made some changes in how to perform network related activity.

TD-er on 27 Feb 2019

Do you mean this one?

ESP_Easy_mega-20190227_test_core_260_sdk3_alpha_ESP8266_4M.bin

chunter1 on 27 Feb 2019

Just about any build of last night.
I also added a "normal" version for core 2.6.0 using SDK2.
Core 2.6.0 SDK2 is now running on several nodes and have only had a HW watchdog on one of them which is exceptionally low on resources (running lots of high memory plugins) and also the one with the probably worn-out flash memory (should throw that one away since it has had 1000's of firmware updates)

TD-er on 27 Feb 2019

@TD-er I compiled and installed from git about 6 hours ago on 2test nodes.. but as I still have very limited internet access I can probably only report back in 24 hours out so...

clumsy-stefan on 27 Feb 2019

👍1

I have been testing the latest firmware with "Force B/G mode" with Mikrotik router and so far it seems to work stable. Will report back.

@TD-er One question: What is the rule for the fallback to N-mode when "Force B/G mode" is set?

giig1967g on 8 Mar 2019

@giig1967g
If it failed to connect for over 10 times to the AP with _B/G only_, it will revert to the default BGN mode.

So this leaves the possibility it will connect in N mode if the AP was offline for some time.
If this appears to be a real issue, I can try to set it to only do that every 10th attempt, but then I have to make sure it will also try that when the fallback ssid is tried.

TD-er on 8 Mar 2019

Hi @TD-er
I think that the fallback is a bad idea.
Look at this scenario: I have my unit forced to B/G mode running happily with my Mikrotik.
For some reason my Mikrotik goes offline (update, reboot, whatever).
Whwnever the mikrotik comes up again, the ESP will then connect in N-mode and I lost the stability of the unit.

In other words, if I set Force B/G mode and connection fails then it should become an AP (192.168.4.1), but shouldn't fallback to N-mode. The fallback possibility creates a false expectation to the user.

Don't take me wrong, the fallback was good for testing the validity of the change, but now that it has proven to solve the issue (at least mine so far) the fallback should be removed.

I agree with @chunter1 comment: https://github.com/letscontrolit/ESPEasy/issues/1987#issuecomment-462295204

giig1967g on 8 Mar 2019

What do you think about the next scenario then.
As soon as it is possible to connect to the given AP's using B/G only, an extra flag will be set to provide no fallback anymore.
If some of these settings (B/G only, or other AP settings) change, the fall-back will be enabled again until it was successful to connect to the given AP.

Edit:
With fallback I mean those extra settings, not the "fallback SSID"

TD-er on 8 Mar 2019

In other words, the fallback remains active only until the first successful connection to the AP, then it is removed. In this case it would be helpful to see in the syspage the wifi mode.
It's a possible solution even if I still prefer to avoid fallbacks to N if I set Force B/G.
I understand the possibility of the user to make mistakes, but then the user could type the SSID wrong and not get access to the unit in anycase...

Another questio: in current implementation in which scenario will the unit become an AP?
Because it seems to me that the unit will try B/G for 10 times then try N but will it eventually give up and become an AP?

giig1967g on 8 Mar 2019

Nope, the AP-mode remains active while still testing to connect to the given APs.
Maybe we should also add an optional check for uptime and only allow to start the AP mode in the first 10 minutes the node is booted.
Just for extra security and also to exclude the possibility the AP mode may have an effect on the ESP not being able to reach the given wifi networks.

And we should keep focus on the "easy" part of the project.
This means proper defaults and no overwhelming amount of settings offered, but give the option to the expert to do it all.
This also means there should be a proper fall-back for the less experienced user.
Especially for B/G only settings. I guess 90+ percent of the people starting with ESPeasy are not aware of the differences between 802.11B/G/N So if they experience issues, which can be handled very well by using a fallback, it may cause them to look for other projects.
I also understand why this fallback should not give a false sense of 'stability', so I really understand why the current implementation has room for improvement. But if the "first connect attempt success => disable fallback" is made automatic, then it is also perfect for the more experienced user. (who also makes stupid mistakes, as I know from experience ;) )

I agree it should be made more clear what connection setting is actively used.

TD-er on 8 Mar 2019

understood.
So, the prosal is:
The unit Boots.
N-fallback flag is set to true.
If FORCE B/G is set, it tries 10 times to connect to the wifi AP in B/G.
If it can connect it sets the N-fallback flack to false.
If it can't connect it tries in N-mode.

Please consider also this scenario with Force B/G set: power failure.
ESP and Wifi router are powered off.
Then power returns, and ESP is up quicker than the wifi router.
In this case it could happen that after 10 times the wifi AP is not still listening.
So the unit will try N-mode and eventually will succeed.
(Experienced) user will not know that the connection was in N-mode and thinks that it's in B/G mode with lack of stability.

I don't want to insist, but really this HW and freeze issues have been going since 1 year. Now that you with the help of the community found one working solution, I strongly advice to make sure it remains applied. The risk for an unexperienced user to set the "Force B/G mode" are less than him setting the wrong SSID with the same results: no access to the access point.

SUGGESTION:
In order to make sure the less experinced user does notmake mistake, why don't you add a button to "Test B/G mode". If it succeeds, it enables the "FORCE B/G mode", if not it remains disabled.

In this case Less experienced users will know what they are doing and more experienced users will be sure that when ForceB/G mode is set it cannot fallback to N.

What do you think?

You could

giig1967g on 8 Mar 2019

Not only if it boots, but the fallback option will then be disabled in the settings and saved.

TD-er on 8 Mar 2019

ok. Then a manual check is even more appropriate instead of an automatic check. Dont't you think?

giig1967g on 8 Mar 2019

Yep, I will first add a simple checkmark to disable overrides.
But later it should be made more dynamic in this to also help us, the "experts" to make less mistakes :)

TD-er on 8 Mar 2019

ok

giig1967g on 8 Mar 2019

Version 2019_02_16 was now running for 9 days without reboot (forced to B/G). Yesterday it started to reboot again. The hardware watchdog forced this two times.
After a while the unit was not reacheable any more. Switching of the WLAN router did not help (tried it 3 times, up to 30 minutes). Restarting the ESP by switching of the power fuse did also not help.
I had to shut down wlan again and than connect to the ESP via 192.168.4.1 direct to get access on it

I have absolutely no idea what happened

kischde on 10 Mar 2019

@kischde What kind of accesspoint are you using?
Last night I experienced something similar myself while experimenting with installing a new AP.
I was trying to install a MikroTik AP and all was working fine until the ESP needed to reconnect.
Through the web interface of the MikroTik, I could see the node was connected to the WiFi, but it didn't receive an IP address.
Even when I tried to connect it to my phone's hotspot, I could reboot the ESP but it repeated this behavior.
Only when I set the main AP config to another AP it started to connect like it should.

Note that the node was not rebooted to achieve this.

One thing that is set "incorrect" on that AP, compared to another MikroTik I have, is that it has the "Distance" setting set to "indoors".
I will perform more tests to see what's the difference here, but like discussed before, it seems to be some timeout setting for a reply on a packet.
And I can imagine the ESP does take some time to respond to DHCP requests, which may be just too short for this setting.

TD-er on 10 Mar 2019

I am using a Swisscom AP (swiss Telecom provider AP), but I had all the same issues written here in the different places like the guys with the MikroTik. So it maybe has the same chipset, but I can´t change a lot in the setting, for example those extended settings.
Before re-powering I also tried to connect with my mobil phone, like you did, with the same behaviour than you.
I use fix IP, no DHCP
I switched now to the actual 2019_03_05, will see what happens...

kischde on 10 Mar 2019

Did you recently change WiFi settings?
For example access point with MAC address AA:AA:AA:AA:AA on position 1 of the SSID's another AP with MAC address BB:BB:BB:BB:BB:BB
I have got the feeling there are some settings left in some place in the ESP where we do not store them.

I will also try to see in the NonOS docs how these can be effectively cleared when we change WiFi settings.
Also in my tests, it appears to be really useful to have more than 2 SSID's to be used.
I will also try to add more field for this, or maybe even allow to store some encrypted file in SPIFFS to have a near unlimited amount of WiFi AP's stored.

TD-er on 10 Mar 2019

Did you recently change WiFi settings?

No, as I have only one AP, this was IMO not necessary

kischde on 10 Mar 2019

OK. Good to know.
Since I don't know yet what is causing this, I'm trying to eliminate as much unknowns as possible.
So the only thing you needed to do to make it work again is add the setting for wifi again.

TD-er on 10 Mar 2019

No,even less
I forced to connect direct to the esp via 192.168.4.1 (disabled the AP)
Than I checked everything, but did not find any abnormal things... So I give it a try and restartet my AP, than reset the ESP and it run again. So IMO I just had to force to do a "normal/other" WLAN connect.
BTW I also saw it connected at the AP, but also no IP Adress assigned, however it´s static

kischde on 10 Mar 2019

Hmm, I've been playing with it a bit, with 5 nodes connected to the MikroTik I'm playing with.

As soon as I set "Hw. Fragmentation Threshold" (just 'unfold' the option), the ESP nodes no longer are capable of receiving any IP address anymore.
The default value of this setting is 256. If I set it to 1600 (will fit a full MTU), all nodes will receive an IP address and continue to work.

This is shown in the MikroTik UI when the nodes are able to communicate:

And this when they are not able to send/receive any data (but are connected to the WiFi layer)

TD-er on 10 Mar 2019

@TD-er did you see that there is an option in the new esp8266 core called LWIP_FEATURES which I think will activate IP reassembly... Probably that's not defined in your build?

see here: https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/tools/sdk/lwip2/include/lwipopts.h#L748-L754

also support for IP fragmentation is set with this:
https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/tools/sdk/lwip2/include/lwipopts.h#L756-L763

clumsy-stefan on 10 Mar 2019

Hmm, not sure what the defaults are.
Are the numbers given on the first line in the Doxygen documentation the defaults?

TD-er on 10 Mar 2019

I don't know... I just know, that in the Arduino IDE in the plattforms definition file you can select if you want it included and defined or not..

clumsy-stefan on 10 Mar 2019

I always thought the LWIP parts were included as pre-compiled libraries in the core distribution.
So then it is quite hard to make sure LWIP is rebuilt using the correct flags.

TD-er on 11 Mar 2019

yes, but it depends which one you link against (from boards.txt):

-llwip2-1460-feat
-llwip2-536-feat
-llwip2-536
-llwip2-1460
-llwip2

clumsy-stefan on 11 Mar 2019

So they're using these flags:

https://github.com/esp8266/Arduino/blob/192aaa42919dc65e5532ea4b60b002c4e19ce0ec/boards.txt#L357-L387

|Label | build.lwip_lib | build.lwip_flags |
|---|---|---|
|v2 Lower Memory | -llwip2-536-feat | -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=1 -DLWIP_IPV6=0|
|v2 Higher Bandwidth | -llwip2-1460-feat | -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=1 -DLWIP_IPV6=0|
|v2 Lower Memory (no features) | -llwip2-536 | -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=0 -DLWIP_IPV6=0|
|v2 Higher Bandwidth (no features) | -llwip2-1460 | -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=0 -DLWIP_IPV6=0|
|v2 IPv6 Lower Memory | -llwip6-536-feat | -DLWIP_OPEN_SRC -DTCP_MSS=536 -DLWIP_FEATURES=1 -DLWIP_IPV6=1|
|v2 IPv6 Higher Bandwidth | -llwip6-1460-feat | -DLWIP_OPEN_SRC -DTCP_MSS=1460 -DLWIP_FEATURES=1 -DLWIP_IPV6=1|
|v1.4 Higher Bandwidth | -llwip_gcc | -DLWIP_OPEN_SRC|
|v1.4 Compile from source | -llwip_src | -DLWIP_OPEN_SRC|

So all v2 versions have LWIP_FEATURES=1 except for the ones labelled as "no features"

TD-er on 11 Mar 2019

yes, but if you look at the -lstatements, you need to youse the libraries with -feat at the end (in contrary to the label, a bit conter-intuitive)...

clumsy-stefan on 11 Mar 2019

Espeasy: [BUG][WiFi stability] ESP Exception 3/29 when layer 2 disconnects

Summarize of the problem/feature request

Expected behavior

Actual behavior

Steps to reproduce

System configuration

Rules or log data

Most helpful comment

All 195 comments

Related issues