Espeasy: MQTT stops working since 20181008

Created on 18 Oct 2018  ·  76Comments  ·  Source: letscontrolit/ESPEasy

BUILDS! ---> 20181017

Summarize of the problem/feature request

Since build 20181008 I have the problem that MQTT "hangs" regularly. Then no more values are transferred. For example, I see that the LWT status of Connected is no longer there in the IOBroker. If I take build 20180927, everything is stable again immediately.

Expected behavior


Stable connection of MQTT

Actual behavior


ESP looses connection

Steps to reproduce

using newer builds than 20180927 (have used 20181008..till 20181011 and 20181017)
Sreenshot is with build 20180927. With newer Builds connection "Connected" disappears and f. ex. Spannung is not been updated anymore.

Yes, After some time (not always after same time), IOBroker looses connection to client.

Still exits

System configuration

Hardware:
ESP Wroom2

ESP Easy version: Release mega-20181017
mqtt
device
mqttconf

Rules or log data

only one rule:
on MQTT#Connected do
Publish %sysname%/status/IP,%ip%
endon

Controller Stabiliy Fixed Bug

All 76 comments

In 20181004 the following has changed:

  • [sendHttp] #1830 Set timeout and early exit on timeout reached
  • [WiFiClient] Set timeout and make it configurable for controllers
  • [Core 2.4.1] Move back to core 2.4.1 from 2.4.2

And in 20181006:

  • [WiFi] Add delay to connection attempts in ControllerSettings

Could you perhaps test the 20181003 build and maybe these other 2 also to narrow down the problem?

i will trybut I'm afraid I won't be able to do it until Saturday. ;-)

Do you have any idea on how long it takes before the MQTT connection is lost?
And does it somehow coincide with a wifi reconnect?

As far as I could see, there was no Wifi loss.
that the MQTT connection was lost was quite fast this morning (10 min) but I also had hours according to the last entry from the IOBroker...
I will try to start at least one of the 3 builds tonight.

Just as a check; Do you build it yourself, or use nightly builds?

another thing to check:

are you sure MQTT is stopping working ?
My one here shows after a while "Connection LOST" but everything is still working with MQTT.

The only bug i see is the message "Connection Lost" while everything is still working.
For example i get uptime minutes by MQTT controller.
After a while (between 10minutes and 10 hours) i see "Connection Lost" but the minutes still counts up send over MQTT.

Greetz
Sascha

MQTT should do a reconnect as soon as it has detected it was no longer connected.
See also @Sasch600xt other related issue:
https://github.com/letscontrolit/ESPEasy/issues/1873#issuecomment-431170001

A few builds before 20180927, the one you report to be working just fine, I added a fix for MQTT controllers in which the MQTT client on the ESP will give itself a new client ID, just to prevent the broker to refuse a new connection when the broker still assumes an existing connection from that client and the client thinks it is disconnected.

Can you increase the timeout setting of the controller? The default was 1000 msec, before I introduced that setting and the first build had 100 msec as default. Later I changed the default (for new settings) to 300 msec.
So maybe you can try to set it to 300 or higher (no higher than 1000 msec) just as a test. (no matter what version, at least one that used to fail)

I did experience the same issue with 1 (out of 5) units. I have played a bit with the MQTT controller settings (Minimum Send Interval, Max Queue depth, Max retries & Full Queue Action); settings which work now for me with this unit: 1000ms, 5 / 5 and "Delete Oldest".

Ok news...
Looks like it is the same issue as blb4github commented 12 hours ago is talking about. Took a NEW Unit and it works until now without Problems!!!! Looks like the "old" one does have memory problem or timing problems. I will try to increase the timout in this one... keep you informed ;-)
Btw.. (Sorry.. Selfbuild with Atom.. because i don't want all Plugins... But until now it always went good..) You all are doing great work on this... and as far as i discovered it is only this (and therefore the one with the newest SW) which does have this problem. I will increase settings and let you know.

Regards Peter

@fraeggle is it possible for you to paste here screenshot of good and bad modules systeminfo pages? (esp and storage sections)

And some description of the hardware.
For example I have the feeling there have been some changes recently in sonoff modules.
Sonoff basic and the S20

@fraeggle Could you perhaps also perform a full wipe of the problematic node and start clean?
Empty bin files are included in the release ZIP files.

Hi TD-er
i had erased the bad node before... Since i have changed Timeout Settings to 800ms
it is stable. Voltage is being updated every 120 sec.

Hardware nothing special.... becaus still waiting for INA219 to arrive.. ;-)

esp_good
esp_bad
mqtt_neu
devices
Regards Peter

ups i unfortunately closed it..

But i think with this "workaroud" it can be closed... I really think it is a Unit problem now..
What do you think about it TD-er?

Still it is strange the units differ in this aspect.
It would suggest the wifi stability is worse in one unit

i agree but I think that this unit simply has too large a tolerance range. As i told you, since i changed to 800 ms it works fine...
so if you agree i will close this one, due to the fact that there is the ability to change the timeout.

Thanks for your help..

regards Peter

i found "since 20181007 MQTT Open Hab shows "Connection Lost" #1873.
Maybe the same issue?
TD-er tell me if i can help by creating logs or something like that...
Using Openhab (with IOBroker). but until now no error after setting Timeout to 800ms

Is the MQTT broker for OpenHAB you are running, installed on a computer in your own network, or external?
And if local, is it perhaps installed on a computer which lacks resources?

i have also a test running since 394 minutes at 20181020.
Timeout at 800ms.
Sometimes it shows up as Disconnected after 24 hours or even longer. But i never reached 2 days.
And again, fo me all is still working after it shows "Connection Lost". So it´s just the message what is confusing for me. i´ll keep you informed

Connection Lost is only an information message indicating... the connection was lost.
But it should restart a new connection.
In the sysinfo page you can see how often the WiFi connection was lost and rebuilt and also how long it is connected to WiFi since the last (re)connect to the accesspoint.

As soon as the WiFi connection is interrupted, it will also assume the MQTT connection has to be rebuilt, so it will consider the connection to the MQTT broker to be lost.
Until 4 weeks ago, it was possible the MQTT broker would not accept the new connection attempt, since the broker still considered the client to be connected.
This could result in indefinite reconnect attempts, when the client kept trying to connect and the broker does not accept the connection.
I changed the MQTT client ID on every reconnect (adding the number of reconnects to the client ID) so the broker would consider it a new client.
This results in a speedy reconnect, with as only drawback, the broker would send the LWT (last will and testament) since it assumes the client to be disconnected.

If that results in undesired behavior, I can change it into a new strategy.
For example on a reconnect attempt which fails, I can try to send a proper disconnect message first.

In short, there are numerous options on which the connection to the broker can get lost and I'm not sure why in your case it is lost.
Could be a timeout, WiFi disconnect, some unknown response of the broker or any other reason.

ok.....got it.
Very good statement, thank you.
I am at minute 507 right now / Open HAB controller / 800ms timeout.
So far all is good :)

Should I set the default timeout for new instances of a controller back to 1000 msec?

well......gimme 36 more hours.....then i know better about the bug is comming also with 800ms or not.

Minute 629 now without problems...

weird... discoverd following after change to 800ms.
Last Disconnect Reason | (200) Beacon timeout
Number reconnects | 3
What das Beacon Timeout mean?
MQTT still running but right now the same as Sasch600txt. But different than before... MQTT still working
only broker tells this one is not connected.
mqtt

For more information on the Beacon frame, see Wikipedia
In short, the access point sends periodically a packet containing information about the network.
This interval is typically 100 TU (102.4 msec).
The ESP module is trying to listen to this beacon every time, but for a number of reasons it may miss a beacon frame. The timeout is longer than 100 TU, so it must miss a number of these beacon frames to report a beacon timeout.
Reasons to miss such a beacon frame are:

  • too busy processing other blocking tasks (very likely)
  • access point not sending a beacon due to high traffic demands of others (depending on brand/model/settings)
  • RF disturbances (not likely given how often this occurs)
  • clock drift (not really likely)

So the "beacon timeout" is happening every now and then on the ESP nodes.
I am working hard to get every plugin/controller to use short time slices to make tasks as little blocking as possible.

i can confirm all fraeggle said.
Somewhen tonight i got "Connection Lost"

Have a nice weekend :)
Sascha

@fraeggle:
if it shows up as "Connection Lost" try to go into the Controllersettings of ESPEasy, simple disable the controller -> save , enable it again -> save. Then at least for me it shows up as connected again. Not a solution, but at least intresting :) is this IPSymcon you are using ?

@TD-er:
too busy tasks ? well......at the "TEST UNIT" i have running here, i only send uptime minute every 60 seconds......nothing more is running at this unit.....so probably the smallest possible system

@Sasch600xt the terms "too busy" is maybe not the best one describing the real issue.
The Arduino way of doing things is:

  • call setup()
  • call loop() over and over again.

On top of that, the Arduino version for ESP8266 (and ESP32 Arduino) are running some tasks outside the loop() for the Arduino part.
These tasks are about background processes, like handling network connection and incoming traffic, etc.
The background tasks are executed only at:

  • end of a loop()
  • when calling delay() or yield() from within Arduino space.

If a loop run takes more than 102.4 msec and no calls to yield() or delay() are made, the ESP node will have missed a beacon interval.
Also if it is running several blocking tasks which are always busy right at the moment the WiFi access point is sending the beacon, a number of beacons will be missed.

When you look at the serial log (with Debug level enabled) you will see the timing statistics.
Some of them may make several tens of msec and thus are candidate for causing these 'disconnect' reasons.

I could add a task to the scheduler to try to listen to the beacon every 102.4 msec. Only thing is, I am not sure how to see when such a beacon has been seen.

About this issue, I could look into the disconnect/reconnect of MQTT when a connection has been lost.
What broker are you using? I am using Mosquito here and it is working fine with the current behavior.

ok, "too busy" was a bit easy saying of me :)

I am using a script running in my ipsymcon housecontrol eviroment for MQTT broker

Greetz
Sascha

Hi Sascha.. Using IOBroker.
@TD-er went back to Build 27092018. Broker Tells me still Connected...
Really Confusing.. NO errors within 14 Hours ( Number reconnects | 0)

I am installing IObroker now on my computer, just to see what's going on.

Edit:
45 minutes later and I stil am not able to get IObroker to run on my computers.
Not sure what's going on here, but the Windows installer just doesn't work (the service bat file is nowhere to be found) and installation on Linux just doesn't finish. It keeps trying to do the same npm install over and over again.
Tested on Ubuntu 18.04 on an Intel CPU host and Bash on Windows (Ubuntu 18.04)

not sure with windows.. i think there are some software dependencies. Runing it on Raspberry.
for Windows ioBroker verwendet Node.js als Plattform und setzt diese voraus. (Download: http://nodejs.org/download/) first node.js has to be installed.

I also have an issue with one module with 4 DS18B20 sensors connected.
I thought it was my RasPi but took a clean Raspbian Streth image and installed mosquitto and node-red onto it. Same issue, connection breaks after 6-12 hours.

afbeelding

afbeelding

Dashboard: https://emoncms.org/Edegem/scrtmp2e
The 4 curves from the ESP are CV_aanvoer, CV_retour and Sanitair_warm, Sanitair_koud

@fluppie if you use official builds?

No, I build myself with PlatformIO/Atom
EDIT: Ha, I didn't read well, I'll try an official build.

afbeelding

Let's watch this!

Hi

I have/had a same problem, i using HomeAssistant but after ESPEasy_mega-20181111 fw is the problem looks fixed to me.

Thx
T

Mine was still loosing connection after 2-3 days. I'll update to: ESP_Easy_mega-20181112_normal_ESP8266_4096.bin

Let's see!

Mine was lose the conn. after 1-10 hours. I have ~10h30min uptime and all (5pcs) the related modules connected. :-)

seems it would be worth to try... still on 2709 because need a stable connection. Pls keep me informed. :-))

You can also follow via: https://emoncms.org/Edegem/scrtmp2e and check if the CV_ and Sanitair_ graphs are there :).

1 day uptime and still ok. :-)

You can also follow via: https://emoncms.org/Edegem/scrtmp2e and check if the CV_ and Sanitair_ graphs are there :).

:-D Sanitair_warm nearly 60 C ? little Campfire? ;-)
I will try the firm.. Thanks...

I was showering ;-)

1 day... still connected :-D
using 12112018

After ~3days 3hours one of my unit lost the MQTT connection. :-( ( mega-20181112 4096 VCC fw )

@redskinhu:
just to make sure, is it still working well after the unit lost MQTT connection ?
because on my side, it is only showing up as "connection lost" but still working fine with all MQTT actions.

Greetz
Sascha

Indeed, a lost connection is not uncommon every now and then.
As long as the connection is being rebuilt without human intervention, there is no problem.
WiFi connections will be reset every now and then and there is nothing you can do to prevent it.
As soon as such a connection is lost, it notifies MQTT clients on the same node to reconnect.

It looks some kind of LWT related problem. MQTT is not totally disconnected, ESPEasy sends / recieve MQTT messages but the LWT is not renewed so the HomeAssistant didn't get the LWT connected message therfore it shows the relevant sensor/switch is unavailable. This is my theory...

Mine Still connected and unlike my first post even MQTT LWT tells me connected. Looks good.

Indeed, a lost connection is not uncommon every now and then.
As long as the connection is being rebuilt without human intervention, there is no problem.
WiFi connections will be reset every now and then and there is nothing you can do to prevent it.
As soon as such a connection is lost, it notifies MQTT clients on the same node to reconnect.

OK is there a possibility to renew LWT connected in this case?

It is supposed to send the LWT at the moment it renews the connection.

I have a node right now that is shown disconected in the LWT and it sends measurments just fine. A bit older build... maybe this tells you something

GIT version: | mega-20181008
Uptime: | 3 days 17 hours 36 minutes

I have seen strange things when the ESP node thinks it was disconnected and needs to reconnect, but the broker disagrees.

Hi

I made a small investigation. It looks just the LWT is the problem. One of my ESP produced this MQTT "disconnected" thing again.

Everything is working fine, sensors / switches. But the Home Assistant couldn't see them because in the config I provided the LWT details.

- platform: mqtt
  name: "Socket 02"
  command_topic: "home/Socket02_ESP12F/GPIO/13"
  state_topic: "home/Socket02_ESP12F/Relay/State"
  availability_topic: "home/Socket02_ESP12F/status/LWT"
  payload_available: "Connected"
  payload_not_available: "Connection Lost"
  payload_on: "1"
  payload_off: "0"
  qos: 1

I checked the MQTT traffic: I got the sensor data and I could turned on/off the GPIO. Everything is OK ex. the LWT.

image

And after I publicated a LWT "Connected" message and everything went back to normal.

image

I hope it helps.

T

P.S.:
I thinking about one rule which can publish LWT Connected message if the LWT message is Disconnected, but sadly i can't import MQTT strings ;-)

Anyway I have been eagerly looking forward to the new fw release. 4 long days....

I've been away on Monday/Tuesday and the days after were really busy ;)

I will have a look at the LWT to see if I can find why it isn't published on reconnect.

What is the timeout setting of the MQTT broker?
I am thinking about the possibility the broker assumes a lost connection, but the ESP node still acts like it has been active and still is connected.

I tried 100-1000 ms. Now 300. Doesn't matter.

Not in the controller, in the broker.
Those timings are in the order of 10 - 15 seconds.
So please check in the config of the broker what timeout is being used.

I didn't change anything about the LWT timeout and i couldn't find any relevant settings in my Mosquitto config. It need to be default. I couldn't find any LWT timeout setting in the docs too.

No not the LWT timeout, the timeout of the broker.
If a client doesn't send a message in the timeout period, the broker will consider it disconnected.
So if the client uses another timeout setting, it could be the broker considers the client disconnected and the client does not try to reconnect.

What I understand from the mqtt specification the Broker send LWT when not received a message from the client within the Keep Alive Period which is set by the client. Nothing to configure at the broker side.

See https://www.hivemq.com/blog/mqtt-essentials-part-10-alive-client-take-over

And

https://www.hivemq.com/blog/mqtt-essentials-part-9-last-will-and-testament

That's true, but what I want to know is what this "Keep Alive Period" is on the broker side.
I know it is fixed at the ESPeasy side.
But if these are not in sync with eachother, you will see strange things happening.

hm i don't have any possibilities to set something like this
mqtt
But still says "Connected". 4 Days :-D

just had one mega-20181006 do the same. LWT not updated to Online but normaly working MQTT

@jimmys01 Can you see in your broker if it is basing the decision on the client ID? (this changes when connection is lost)
This change of client ID is something I added in the late September builds.

@TD-er, I found a way to reproduce this! I de-auth the wemos from the mikrotik router and it connects right back up, but the LWT remains offline while the mqtt is still working. Send me builds to test at will

Can you also see the client ID in the MQTT broker?
If it has a "-1" or something behind the client ID, it means it accepts the new client ID.
If not, then it may have something to do with the change of client IDs I made a while ago.

I have not fould where the client id is in Mosquitto, but logs show as if that no client is disconected and a new one is connected, probably because the wifi de-auth and auth process is faster that the heartbeat of the MQTT protocol.

New connection after I restart both the esp and the broker

 New connection from 10.10.1.53 on port 1883.
1542965214: New client connected from 10.10.1.53 as aquariums_1 (c1, k10, u'openhabian').

De-auth the client and the client reconnects. The broker this time left the client LWT as online for some reason...

1542965214: New client connected from 10.10.1.53 as aquariums_1 (c1, k10, u'openhabian').
1542965308: New connection from 10.10.1.53 on port 1883.

Second de-auth

1542965308: New client connected from 10.10.1.53 as aquariums_1_1 (c1, k10, u'openhabian').
1542965308: Socket error on client aquariums_1, disconnecting.
1542965385: New connection from 10.10.1.53 on port 1883.

From this reconnect and on the LWT remains ofline, MQTT messages work fine. Notice the extra _1_1 on the client name.

1542965385: New client connected from 10.10.1.53 as aquariums_1 (c1, k10, u'openhabian').
1542965385: Socket error on client aquariums_1_1, disconnecting.
1542965448: Socket error on client aquariums_1, disconnecting.

OK, then I will revert the client ID change.
Maybe it is also possible to check remote if the client is considered to be connected?
I have seen connection refuses when the broker considers the client still connected and the client tries to reconnect.
Retrying in short intervals will keep this status and thus a client never is able to reconnect.

Any status update on this?

Nope, not yet.
But since we're making some progress (finally) on other stability issues, I guess it will be next on my list.
Thanks for the ping ;)

I made it optional to change client ID at reconnect. (Tools => Advanced, next to the other MQTT settings)

Please close the issue if it is working now.
I will set it to fixed.

Why change the client ID anyway?

There were reports of the broker rejecting connection attempts as long as the broker assumes the client was still connected.
So the client got rejected and tried again.
But somehow those reconnects triggered something on the broker side to consider connection attempts as recent activity of that client and thus it would never consider the client to be disconnected.

Ok this soled it, but the solution is not self explanatory for the others that see that check box.
Maybe add a pop up that will explain to tick it if there are reconnection issues after a loss of wifi
A search at tasmota issues will show you that they had similar looking issues, related to the MQTT retain, looked like this was a broker issue.
I also had one client not reconnecting to wifi at all after I de-authed it... need to investigate that.

We will add this to the documentation.
We are working very hard to make the documentation up-to-date and also move a lot of documentation from the wiki to the ReadTheDocs. This makes it possible to have documentation per version.
The docs files are also included in the GitHub repository.

Ideally a commit for a fix in the code will also contain an update in de documentation.
I will now close this issue.
If you have some more information on the other issue you mentioned, please open a new issue for that.

Hi

I installed the 20190110 release recently. It looks promising. No LWT problems after 5 days of test. :-)
I have couple of nodes with "MQTT change ClientId at reconnect" option enabled and some with disabled.

Good news!

T

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TD-er picture TD-er  ·  4Comments

TD-er picture TD-er  ·  3Comments

Wandmalfarbe picture Wandmalfarbe  ·  5Comments

s0170071 picture s0170071  ·  3Comments

jroux1 picture jroux1  ·  6Comments