Portainer: Endpoint Instability

Created on 10 Dec 2018 · 140Comments · Source: portainer/portainer

IMPORTANT UPDATE (29/07/19):

Since this issue was opened, there have been many reports opened with the same or similar reproduction steps. One thing is clear; there is no single root cause.

Through extensive testing across multiple OS, browsers and deployment scenarios we have confirmed the below bugs that can lead to endpoint instability.

_Confirmed bugs (This list will be updated as others are confirmed):_

~#2624 Portainer not resetting agent headers when switching state~ (Fixed)
#2937 Agent running on worker node not updating state when node is promoted to manager
~#2938 Agent takes a while to acknowledge that another agent is unavailable~
~#2949 Portainer and Agent have errors when Docker command takes longer than 10 seconds~

Current status:
Portainer (v1.22.0) brought the fix for #2624 and #2949 (available with the agent release 1.4.0) as well as the long awaited open-sourcing of the Portainer Agent. Through open-sourcing, we hope it will allow us to be increasingly transparent and open the codebase to contributions from the community.

We are now focusing all efforts towards eliminating endpoint instability within Portainer, while being as transparent as possible.

We have also created the channel #fix2535 on our community Slack server as an easier alternative to discussion on this issue. You can join our slack server here.

As we work on fixes for the multiple bugs causing this issue we will post images containing the fixes for those willing to test. Any feedback we can get from these fixes or from your current deployments where you experience the this issue will be of immense help.

There may be bugs that we are unaware of and we want to make sure we cover them all.

---- ORIGINAL BUG REPORT ----

Bug description

Setting a portainer agent stack as described in the official documentation, leads to an unstable endpoint, sometimes up, sometimes down, sometimes showing correct info, sometimes giving error messages inside portainer.

Expected behavior
The agents should comunicate flawlessly with no errors or load problems once joined the agent cluster.

Briefly describe what you were expecting.

Steps to reproduce the issue:

Steps to reproduce the behavior:

Go to shell, and deploy a stack similar to this (stack core)

version: '3.3'

services:
  csi-portainer-agent:
    image: portainer/agent
    environment:
      AGENT_CLUSTER_ADDR: tasks.core_csi-portainer-agent
      AGENT_PORT: 9001 (i also tried with no AGENT_PORT set, with no success)
      # LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
#    ports: (after read other github issue thread i tried set the port, with no luck)
#      - target: 9001
#        published: 9001
#        protocol: tcp
#        mode: host
    networks:
      redeBackbone:
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
...

name: docker-agent
endpoint URL: tasks.core_csi-portainer-agent:9001 (the dns resolution is ok, giving one entry for each agent)(I also tried put the ipvs address core_csi-portainer-agent:9001, it appear to be more resilient, and works better than the official way, but very unstable showing pages. )
public IP: core_csi-portainer-agent (the dns resolution is ok, giving the IPVS address)

Go to the main page

screenshot_2018-11-29 portainer

The endpoint appears to be running and ready, click on it, then the dashboard page appears empty with no data, but the endpoint is selected. So try to browse between containers, services, images, volumes .., sometimes it works, but after a few commands the agents stop working and the agent logs start with entries like this: ([ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout) or this ([INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received) or this (http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)).

screenshot_2018-11-29 portainer2

If I update the service, it will be running for some time and after that it stops working once more. If i left it intact and after some time i retry to browse the agent endpoint, it works for some time before die.

See error

screenshot_2018-11-29 portainer4

Technical details:
Swarm cluster with one manager and 2 workers

Portainer version: 1.19.2
Docker version (managed by Portainer): 18.06-ce
Platform (windows/linux): linux ubuntu 16.04
Command used to start Portainer : docker stack deploy -c core.yml core
Browser: any

Additional context
I've tried to do some kind of troubleshoot, so i put a ubuntu container with some tools in the same overlay network than the stack (portainer + agents). i can see that the internal DNS resolution is ok and i can telnet to the ports listed in logs (9001 and 7946), see below:

- Agents logs

##### agent on manager node ip: 172.20.1.7

2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:46:58 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)


##### agent on worker1 node ip: 172.20.1.8

2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:47:03 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:47:08 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/containers/json?all=1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/volumes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:54 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:04 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:44 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:50:24 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:50:52 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:51:32 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:51:34 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:52:14 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:53:12 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:54:30 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:54:32 [WARN] memberlist: Refuting a suspect message (from: 8db633a4b3a6)
2018/11/30 14:54:48 [WARN] memberlist: Refuting a suspect message (from: 8db633a4b3a6)



##### agent on worker2 node ip: 172.20.1.6

2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:47:03 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:47:09 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:50:35 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:50:53 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:51:32 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:51:45 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:52:25 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:53:12 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:54:05 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:54:30 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:54:33 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received
2018/11/30 14:54:45 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:54:49 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received
2018/11/30 14:55:25 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:55:29 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:55:31 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:55:58 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received

-DNS resolution inside the overlay network

dig tasks.core_csi-portainer-agent

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> tasks.core_csi-portainer-agent
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35022
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tasks.core_csi-portainer-agent.    IN  A

;; ANSWER SECTION:
tasks.core_csi-portainer-agent. 600 IN  A   172.20.1.204
tasks.core_csi-portainer-agent. 600 IN  A   172.20.1.206
tasks.core_csi-portainer-agent. 600 IN  A   172.20.1.203

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Mon Dec 10 13:28:00 UTC 2018
;; MSG SIZE  rcvd: 186

root@3e49a820834c:/# dig core_csi-portainer-agent

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> core_csi-portainer-agent
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34261
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;core_csi-portainer-agent.  IN  A

;; ANSWER SECTION:
core_csi-portainer-agent. 600   IN  A   172.20.1.62

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Mon Dec 10 13:28:07 UTC 2018
;; MSG SIZE  rcvd: 82
=================================

- Network connectivity to services with a new stack deployed inside the same overlay network

root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# nslookup tasks.core_csi-portainer-agent
Server:     127.0.0.11
Address:    127.0.0.11#53

Non-authoritative answer:
Name:   tasks.core_csi-portainer-agent
Address: 172.20.1.216
Name:   tasks.core_csi-portainer-agent
Address: 172.20.1.214
Name:   tasks.core_csi-portainer-agent
Address: 172.20.1.215

root@f199dab2d510:/# telnet 172.20.1.215 7946
Trying 172.20.1.215...
Connected to 172.20.1.215.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.216 7946
Trying 172.20.1.216...
Connected to 172.20.1.216.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# nslookup core_csi-portainer-agent
Server:     127.0.0.11
Address:    127.0.0.11#53

Non-authoritative answer:
Name:   core_csi-portainer-agent
Address: 172.20.1.213

root@f199dab2d510:/# telnet 172.20.1.213 7946
Trying 172.20.1.213...
Connected to 172.20.1.213.
Escape character is '^]'.

Connection closed by foreign host.

###server replies on port 9001
root@f199dab2d510:/# curl -k https://172.20.1.214:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.213:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.215:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.216:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/#

areagent buconfirmed kinbug

Source

jeysibel

👍12 👀4

Most helpful comment

In light of recent conversation regarding this issue, we will be posting daily updates on progress made by our team. This will include testing we have completed, research we have conducted (including articles, stackoverflow comments and previous issues on the repository we think might be related ), any leads we think we may have found and results of any live debugging we have done with users.

Below is the progress to date to fill in the gaps in the fix2535 channel on our slack server, following this as I mentioned previously, there will be daily updates & conversation on this issue until the issue is solved.

Progress to date:
29/08/19:

Anthony Lapenna set up 3 agent endpoints on Digital Ocean (henceforth mentioned as DO):
- An agent endpoint with tasks.agent
- An agent endpoint with Virtual IP agent
- An agent endpoint added with node IP address directly
Louis-Philippe described the structure of the frontend. I read this and thought about whether anything could be the cause of the agent instability issues.
I found some articles that I thought might be leads, I sent them to Steven Kang to read:

30/08/19:

Steven Kang implemented a reverse proxy and used tcpdump to investigate whether issues with IPVS and tcp keepalive timeout could help to reproduce the issue. He was not able to reproduce the issue or see any abnormal behaviour.
I checked the Digital Ocean deployments and they are up and stable

31/08/19:

Steven Kang reported that his agent setup is still stable with a reverse proxy in place
I checked the DO deployments and they are up and stable
A user Brian Christner reports:

I updated my cluster to the latest versions. First, feedback is the endpoint is more stable. I am still having issues (probably unrelated) when loading Dashboard menu as it errors out 3/5 times for unable to retrieve volumes. I have also disabled snapshots and the endpoint seems to respond better. Will keep it running and provide feedback.

Anthony Lapenna notes down Brian's iissue and says that he will investigate.
01/09/19:
I checked the DO deployments and notice errors in the logs, I reported these to Anthony Lapenna
Anthony Lapenna observed the DO swarm was configured to communicate with other nodes on the public IP, whereas the advertise address was using a private IP. We were not able to determine if that could be a cause of any problem so far. We were also seeing a few errors related to the usage of the serf library and the network but no instability was detected.

07/09/12:

I sent the articles I previously sent Steven Kang to Anthony Lapenna to read:
After an investigation into the ipvs issue, Anthony’s understanding is that it would only affect communications with a service on the VIP. However our DO VIP endpoint is stable and does not seem to be affected
Anthony Lapenna noted the tcp keepalive timeout and said that he will investigate this as well as go-proxy timeouts to see if there is an issue.

09/09/19:

Anthony Lapenna conducted a live-debug session with a user who reported endpoint instability issue but the problem was not related to the agent.
I set up my own agent deployments on my local machine running linux and vagrant VM’s and left them running. I have been using them and have not been able to reproduce the bug.
11/09/19: I checked the DO deployments and they are up and stable
12/09/19:
I opened a bug that Anthony Lapenna and myself discovered which is related to the agent, here: #3083
Anthony Lapenna is still investigating go-proxy timeout but has not yet been able to reproduce endpoint instability.
Louis-Philippe discovered a bug where the front-end was not able to change an endpoint to up after a successfull ping: #3088
• Alphadev23 reported that Portainer 1.22.0 is less stable for them than the previous Portainer version. I noted this down to investigate further.
• Anthony Lapenna let me know he is also spending time looking into a potential update to the default serf configuration shipped with the agent, to see if this helps with stability.
• Anthony Lapenna is deploying the same deployment we have on DO to AWS:
◦ An agent endpoint with tasks.agent
◦ An agent endpoint with Virtual IP agent
◦ An agent endpoint added with node IP address directly
◦ With an added endpoint: An agent endpoint with traefik as a reverse-proxy
13/09/19:
Anthony talks with alphaDev23 to debug his environment, alphadev reports stability with portainerci/agent:feat-skip-ingress image and ingress mode port mapping
14/09/19:
I opened the issue that Louis-Philippe reported related to the agent #3088
I investigated alphadev23's issue on the moby repo https://github.com/moby/moby/issues/37458 to see if it had an effect on the agent, due to our use of the host-mode port.
I found that using publish-add to add a port to the agent container resulted in the agent endpoint going down in portainer and the agents throwing this error [ERR] memberlist: Failed to send ping: write udp [::]:7946->10.255.0.29:7946: sendto: operation not permitted
After discussion with Steven I now know this is because you need to use endpoint_mode: dnsrr with host mode when adding ports to a service, otherwise the service becomes unresponsive. I noted I need to investigate this further with alphadev to see if there is another triggering factor.
I began investigating whether different timezones between nodes had an effect on the agent endpoint, based on the discussion by user Claude Robitaille in the fix2535 channel on slack
I managed to get endpoint instability with a two node swarm in vagrant; 1 manager and 1 worker. I changed the timezone of the manager and then made the worker leave the swarm and then re-join. The agent then deployed back to the worker and the endpoint was shown as down in Portainer, while still responding to pings & docker info reporting the swarm as active. This was not 100% reproduceable in my testing though.

itsconquest on 15 Aug 2019

❤3 👍2

All 140 comments

Hi @jeysibel, is your Docker environment hosted inside a specific cloud provider infra? (AWS, GCP...)

deviantony on 13 Dec 2018

We have a similar problem here. I set up a 2 node Test-Cluster where I am testing portainer 1.20.0 with agent version 1.2.0 under docker version Docker version 18.09.0.

Before the version updates we had the exact same problem as described by @jeysibel.

Our setup are 2 VMs (Ubuntu 18.04.1) as Nodes (1 Master, 1 Worker). Our stack is actually described like this:

version: '3.2'
services:
  portainer:
    image: portainer/portainer:1.20.0
    command: --no-analytics -H tcp://tasks.portainer-agent-internal:9001 --tlsskipverify
    networks:
      agent_network:
      traefik:
    volumes:
     - portainer_data:/data
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

  portainer-agent-internal:
    image: portainer/agent:1.2.0
    environment:
      AGENT_CLUSTER_ADDR: tasks.portainer-agent-internal
    volumes:
     - /var/run/docker.sock:/var/run/docker.sock
     - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      agent_network:
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

networks:
  traefik:
    external:
      name: somename
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:
 ```
When I fresh deploy this stack (meaning that I also had to remove the portainer-volume), and then login into portainer, the agent will have the status UP. 

But there is already a startup error:
```txt
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:37:37 http error: endpoint snapshot error (endpoint=primary, URL=tcp://tasks.portainer-agent-internal:9001) (err=Error response from daemon: )

When I click onto this endpoint, everything seems to work until I click on the Volumes. It takes some time till an Error occurs. And after this error the agent will not work anymore.

Portainer outputs these logs:

management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:21 http: proxy error: context canceled
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

And the portainer-agent logs:

management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22false%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:12 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:12 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:27 http error: Missing request signature headers (err=Unauthorized) (code=403)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:38:42 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22true%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:42:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

When reaching this status no restart (agent or portainer) will help recovering. I have to remove the stack and the volume and redeploy everything. This works until someone will click on the volumes again.

As additional info: The volume command is actually really slow for our docker engine it takes like 9 seconds. That is a problem we have to work on but all that should not blow the portainer agent like it actually does.

Argh. And the volumes are also loaded on the dashboard. So it will break nevertheless.

unglaublicherdude on 14 Dec 2018

I solved our Issue. And I now think it is different. My problem was a volume plugin spec pointing to a socket that did not exist anymore because we removed the daemon. After removing the specs portainer runs fine. But I still think this should not break the whole agent.

unglaublicherdude on 14 Dec 2018

Hi @jeysibel, is your Docker environment hosted inside a specific cloud provider infra? (AWS, GCP...)

We're currently using Private Cloud + Docker machines (18.06) installed on ubuntu 16.04 VM's, all of them in the same network/vlan, network traffic is ok between hosts.

jeysibel on 14 Dec 2018

Hi! We have similar issue, endpoint is marked as down. Restart and update don't help.
The same messages in portainer-agent logs:
2018/12/29 16:57:50 [WARN] memberlist: Refuting a suspect message (from: 28e7b98c2624) 2018/12/29 16:58:01 [INFO] memberlist: Suspect b8b37729c657 has failed, no acks received 2018/12/29 16:58:09 [WARN] memberlist: Refuting a suspect message (from: b8b37729c657) 2018/12/29 16:58:14 [INFO] memberlist: Suspect d10dcd14da8d has failed, no acks received 2018/12/29 16:58:17 [INFO] serf: attempting reconnect to 8b3adfa511ac 10.0.0.53:7946 2018/12/29 16:58:20 [ERR] memberlist: Push/Pull with d10dcd14da8d failed: dial tcp 10.0.0.71:7946: i/o timeout 2018/12/29 16:58:20 [WARN] memberlist: Was able to connect to f97057df58c8 but other probes failed, network may be misconfigured 2018/12/29 16:58:31 [INFO] memberlist: Suspect 28e7b98c2624 has failed, no acks received

We have portainer in swarm mode deployed with this stackfile:
https://downloads.portainer.io/portainer-agent-stack.yml

nksupport-protsko on 29 Dec 2018

👍4

Same issue and exactly the same stack file as above (straight from the documentation). Endpoin refresh through the web UI usually solves it very quickly but it's quite unstable.

Agent logs:

portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:12 [INFO] serf: EventMemberJoin: 53222c54e5be 10.0.6.6
portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: c6141eafd580 10.0.6.8
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: c6141eafd580 10.0.6.8
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: 53222c54e5be 10.0.6.6
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:47:55 http: TLS handshake error from 10.0.6.3:43078: EOF
portainer_agent.0.ivcqpesbpxzk@node2    | 2019/01/07 22:44:46 [INFO] serf: EventMemberJoin: 23ec18d0df73 10.0.6.7
portainer_agent.0.ivcqpesbpxzk@ node2    | 2019/01/07 22:45:06 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)

Portainer log:

portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Templates already registered inside the database. Skipping template import.
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Instance already has defined endpoints. Skipping the endpoint defined via CLI.
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Starting Portainer 1.20.0 on :9000
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:39 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:38 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=node1, URL=tcp://tasks.agent:9001) (err=Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

etc.

lifepeer on 7 Jan 2019

😕4

Similar issue here, sometimes portainer cannot connect to swarm but the swarm/containers are ok.
Portainer notifies the endpoint is down, after a while without messing with portainer ui the endpoint becomes online.
This happened very few times with previous version but with latest 1.2.0 is very common, almost everytime I'm connected to portainer UI after some minutes.

Agent logs:

portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:27 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 15:27:40 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:50:45 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:50:50 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:25 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:31 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:35 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http error: An error occured during websocket exec operation (err=websocket: close 1000 (normal): websocket: close 1005 (no status)) (code=500)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http: response.WriteHeader on hijacked connection
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http: response.Write on hijacked connection
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:50 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/09 07:02:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/09 08:47:19 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 15:27:10 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 17:51:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 17:51:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 15:27:35 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 15:27:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 15:27:58 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 17:48:28 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 17:50:46 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/09 07:13:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/09 07:50:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 15:28:06 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:48:28 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:50:38 http error: The agent was unable to contact any other agent (err=Unable to find the targeted agent) (code=500)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:50:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:51:36 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/09 07:00:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:10 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/09 07:14:17 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http error: An error occured during websocket exec operation (err=websocket: close 1000 (normal): websocket: close 1005 (no status)) (code=500)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http: response.WriteHeader on hijacked connection
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http: response.Write on hijacked connection
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:26 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:35 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:41 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:58 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:28:06 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 17:51:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:00:31 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:03:00 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:13:59 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:14:17 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:50:59 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 08:47:20 http error: Missing request signature headers (err=Unauthorized) (code=403)

Portainer log:

portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:21:44 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:21:44 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:10 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:35 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:41 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:58 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:28:06 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:48:28 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:38 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:46 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:22 websocketproxy: Error when copying from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:36 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 15:29:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:00:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:03:00 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:13:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:14:17 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:50:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 08:47:20 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:46 http: proxy error: Docker container identifier not found
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:50 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:50 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:08 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:10 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:13 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:13 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:16 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:16 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:50:04 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

Jacq on 9 Jan 2019

😕4

Einstein42 on 12 Jan 2019

👎7

ezhik97 on 13 Jan 2019

👎2

PrathikGopal on 13 Jan 2019

👎2

For those experiencing the endpoint is down error, I think that most of your problems are related to https://github.com/portainer/portainer/issues/2556 which is under investigation.

portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

The preview image portainerci/portainer:fix2556-frequent-offline-mode contains a potential fix for this problem, I'd encourage you to test it.

Note that OP's issue is not related to this issue as I suspect network issues inside the infra/between the Swarm nodes.

2018/11/30 14:47:08 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/containers/json?all=1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/volumes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:54 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:04 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout

deviantony on 13 Jan 2019

Z10yTap0k on 17 Jan 2019

👎7 👍2

Hi
I had the same problem.
With agent version 1.1.2 and everything works like a charm.
With 1.2.0 almost evert time i get timeout.
Portainer version 1.20.0, docker 18.09.1 .

radirad on 24 Jan 2019

👀2

Due to another strange behaviour from one of our deployed apps, i had to deep dive in docker swarm networking, so i'd discovered that, despite docker official create swarm tutorial only states that we have to ensure each worker to be able to connect to managers and vice versa, not about from a worker to another, we've to ensure that each docker node needs to be opened the docker ports to n-1 other nodes ( via iptables rules)
So, when a container placed in worker node A try to connect a container placed in worker B it routes traffic between the two workers (directly) instead through managers, so once i've opened the iptables ports in from 1 to n-1 docker nodes (managers and workers) the portainer agent status stopped to be shown as down ...

From now I continue having problems sometimes when i try to read service logs or specific container logs or try to exec a console from docker-agent endpoint, if i try to read via console on swarm manager (where portainer is placed) everything is ok.

I suggest to the portainer's devs to put a note or a troubleshoot section in the portainer agent documentation page (https://portainer.readthedocs.io/en/stable/agent.html) specifying that portainer agents need to talk each other and not only n agents to 1 portainer directly ... this way we've to ensure communication between docker worker nodes must be active with the service ports opened to the both sides ....

PS: I think It's bug only happens when you apply firewall rules to the docker machines, in this case, I think the rules was so much restrictive, due the lack of precise information in the official docker documentation. (https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/ - "The other nodes in the swarm must be able to access the manager at the IP address.")

jeysibel on 24 Jan 2019

@deviantony Is the portainer/agent open source? I was unable to find it as part of the portainer organization.

I wanted to look at the internals and help contribute to a fix.

dang3r on 30 Jan 2019

Hi @dang3r

No, the agent is closed-source. Thanks for your will to help though.

deviantony on 30 Jan 2019

I'm experiencing similar issues like timeout or wrong information which they recover after a refresh. It is a newly deployed swarm cluster without any firewall rules applied. I have noticed that this happens only after sometime of inactivity on the Portainer UI. As long as you navigate in the UI it is ok but if you let it idle for sometime the issue is appearing (most of the times).

baskinsy on 13 Feb 2019

👍2

@baskinsy are you using the latest version of Portainer and agent?

deviantony on 13 Feb 2019

@deviantony Yes, I'm using v1.20.1 and latest agent also deployed as per the instructions. I'm getting inconsistency on displayed values (for example Dashboard shows wrong number of running and stopped containers) which are fixed with a refresh and also timeouts, which again recover with refresh on stacks list mainly and also on connect to container. Those are appearing usually after the UI has stayed idle for some period.

baskinsy on 14 Feb 2019

@baskinsy can you give us more details about your environment (provider, os, arch, docker version) and how did you deploy portainer? Portainer/agent logs can be useful too.

deviantony on 14 Feb 2019

Same problem here. We use ceph for data storage. And we have quite a lot of volumes. So, docker volume ls can take a while.
It seems like agent's timeout is too short. Can we tune it?

rodneygp on 15 Feb 2019

There is no way to update the agent timeout but we can investigate a longer timeout implementation. Thanks for the report.

deviantony on 17 Feb 2019

👍2

@deviantony sorry for the late reply, I'm out of base now and will try to catch some logs when I'm back. I'm using traefik so maybe there is a cache issue involved on displaying wrong values. I'll investigate further and update.

baskinsy on 18 Feb 2019

The only think I can spot for now is that I'm getting several of those errors on portainer container logs

2019/02/19 12:11:52 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/19 12:11:52 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/19 12:11:52 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:05 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:06 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 07:22:06 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 11:26:38 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 11:26:38 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/02/20 11:26:38 http error: Invalid JWT token (err=Invalid JWT token) (code=401)

Will look further.

baskinsy on 20 Feb 2019

@baskinsy this error is related to expired sessions, not really an issue.

deviantony on 20 Feb 2019

hi, @deviantony do you have docs about portainer/agent image parameters? like command:

docker run -d -p 9001:9001 --name portainer_agent --restart=always -v /var/run/docker.sock:/var/run/docker.sock -v /var/lib/docker/volumes:/var/lib/docker.volumes portainer/agent

Could we have any parameters to portainer_agent?

brantxiong on 27 Feb 2019

@brantxiong latest documentation for the agent is available at https://portainer.readthedocs.io/en/stable/agent.html

deviantony on 27 Feb 2019

Hello again,

I think I have managed to find a situation where portainer becomes unstable and also managed to replicate the issue after fixing it. I have a 6 node swarm cluster with 3 managers and 3 workers. Managers are the hosts named gd1,gd2,gd3 with the gd1 being the original swarm Leader. Everything works as expected and portainer is working flawlessly.

After draining the gd1 node and reboot it, gd2 becomes the new Leader of the swarm cluster automatically. From that point portainer started to become unstable and the only thing I can see is that the gd2 node (the new Leader) has 0.0.0.0 IP listed on the Swarm menu on portainer. Multiple refreshes solve the issue of "no endpoint connected" or wrong displayed values. I have also recreate the cluster, redeploy everything and again drain the gd1, reboot it so gd2 became again the new Leader and the problem re-appeared immediately. Portainer container was running on gd2 node from the beginning (I have a persistent share storage for portainer data so it doesn't matter on which node it runs but it was on gd2 and not redeployed during the change of the Leader.

I can see these on logs

2019/03/05 08:19:22 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:22 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:22 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:23 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:23 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:23 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:26 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:19:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/03/05 08:20:41 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

and on Swarm menu all nodes are listed with their correct IP instead of gd2 (the swarm Leader after rebooting the original Leader gd1) which is listed with 0.0.0.0 IP. On the initial setup gd2 was listed with the correct IP but after becoming a Leader it displayed 0.0.0.0.

baskinsy on 5 Mar 2019

Hi @baskinsy

Thanks a lot for the detailed report and the reproduction steps, we'll investigate this and try to reproduce it.

deviantony on 5 Mar 2019

Is there any estimated date on a fix for this issue? Same issue here. Using 1.20.1, docker 18.06.1-ce

From portainer logs:

2019/03/12 02:18:50 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
...

From portainer/agent logs:

2019/03/12 02:58:01 [INFO] serf: EventMemberJoin: 05cf54df6dd6 10.0.0.8
2019/03/12 02:58:01 [INFO] serf: EventMemberJoin: c3a48f1fdd27 10.0.0.10
2019/03/12 02:58:01 [INFO] serf: EventMemberJoin: 40d571ecff83 10.0.0.9
2019/03/12 02:58:01 [INFO] - Starting Portainer agent version 1.2.1 on 0.0.0.0:9001 (cluster mode: true)
2019/03/12 03:02:38 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
...

Per @rodneygp comment 24 days ago, has increasing the client timeout been investigated?

EDIT: The issue definitely appears to be a timeout issue. In my case, there was a domain name specified in the no_proxy configuration that was not resolving with the nameservers specified in resolv.conf. Once I fixed this issue, 'docker info' returned faster. In the case where there is an issue with the dns server, 'docker info' takes longer to return (presumably because it is having difficulty resolving names) which results in portainer showing the node offline.

Without being able to change the timeout, the portainer agent is dependant on the docker daemon responding "fast enough." This makes portainer dependant on the proper functioning of external services, sufficient resources on the docker node, or any other issue that may affect the docker dameon's response time.

Please add the ability to set the timeout via an environment variable. Otherwise, the portainer agent does not have the robustness needed to be viable in a production environment.

EDIT2: After further use, the issue is still appearing even when there are no dns issues.

@deviantony Is there an estimated date to fix this issue. Portainer becomes unusable when the nodes intermittently become unavailable.

alphaDev23 on 12 Mar 2019

Hi all,

I'm experiencing the same issue so +1 for the timeout. It appears that the agent is not waiting long enough to collect data about running dockerd. In my case executing locally a docker volume ls command take a long time :

# date && docker volume ls && date
Fri Mar 22 10:14:41 CET 2019
DRIVER              VOLUME NAME
local               portainer_data
Fri Mar 22 10:14:50 CET 2019
````
On the portainer side, I have "down" endpoint problem.

I already installed portainer in 2 other environments without any issue and the only difference between those environments is the response time of docker-cli command (especially the ` docker volume ls` one).

My logs, on agent :

2019/03/22 09:29:27 [INFO] serf: EventMemberJoin: 8e045b86ce9d 10.0.1.6
2019/03/22 09:29:27 [INFO] - Starting Portainer agent version 1.2.1 on 0.0.0.0:9001 (cluster mode: true)
2019/03/22 09:29:51 http error: Unable to execute cluster operation (err=Get https://10.0.1.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/03/22 09:29:51 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

on portainer :

2019/03/22 09:29:23 Templates already registered inside the database. Skipping template import.
2019/03/22 09:29:23 Instance already has defined endpoints. Skipping the endpoint defined via CLI.
2019/03/22 09:29:23 Starting Portainer 1.20.2 on :9000
2019/03/22 09:29:29 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
2019/03/22 09:29:33 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
```
It seems that having a customizable timeout for the agent would be a nice feature.

Have a nice day !

Kuaaaly on 22 Mar 2019

Hi all,

All this slowness was due to selinux. I "solved it" by setting selinux to permissive or disabled in /etc/selinux/config (don't forget to reboot & restart dockerd after this modification).
That said, a customizable timeout in portainer agent might be a good additional feature.

Kuaaaly on 25 Mar 2019

@deviantony I believe that Kuaaly's comments further demonstrates that the issue could be from different causes. Is there an estimated date to fix this issue. Portainer becomes unusable when the nodes intermittently become unavailable?

alphaDev23 on 25 Mar 2019

Hi there,

As a response to the potential timeout issue, have a look at the instructions below.

Could any of you give a try to the portainerci/agent:custom-timeout image ? It will allow you to specify a custom timeout for Docker requests via the DOCKER_CLIENT_TIMEOUT environment variable.

Its default value is 10 seconds (as it is in the latest release of the agent) and it can be overriden via this env var, e.g. -e DOCKER_CLIENT_TIMEOUT=30.

deviantony on 26 Mar 2019

@deviantony I will pull portainerci/agent:custom-timeout this week and test over the upcoming week(s). I appreciate you modifying the agent which will hopefully resolve this issue.

alphaDev23 on 26 Mar 2019

I'd like to confirm that my issue (very unstable connection to the endpoints) mentioned in this discussion on 7 January 2019 has been solved and that 1.20.2 is very stable using the same configuration. Thank you!

lifepeer on 26 Mar 2019

As a response to the potential timeout issue, have a look at the instructions below.

I haven't tried this particular agent build, but found the same issue with a local 3 node cluster running on CoreOS with a misconfigured RexRay s3fs volume driver that was causing "docker volume ls" to take about 20s. Disabling the volume driver fixed the issue with the agent being unavailable.

Don't know how practical it would be, but would it make sense to query some of these docker values independently? i.e. per node info, containers, volumes, images, with a separate timeout for each? That way if there was an issue with one at least the server-agent relationship could still operate, and you may be able to provide better user feedback of the cause issue.

gjonespf on 28 Mar 2019

Ideally the agent should be able to return a partial dataset instead of cancelling the original request. Although we'd need to update the UI/UX to warn the user about the fact that partial information is rendered. This is something that we'll consider.

deviantony on 28 Mar 2019

👍2

@deviantony After adding an endpoint using portainerci/agent:custom-timeout as the agent on the cluster, the endpoint registered, then the following errors were in the portainer logs and the endpoint was down:

2019/04/06 21:13:06 http error: Unable to ping Docker environment (err=Get https://192.168.x.x:9001/_ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/04/06 21:15:21 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/04/06 21:15:21 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
...

After restarting portainer, the endpoint appeared up again.

At least for the portainerci/agent:custom-timeout agent, there appears to be an issue in the registering the endpoint where the endpoint registers and absent a portainer restart will produce the above errors.

Please advise.

EDIT: After a brief period, a different endpoint went down. After refreshing the endpoints, it is back up but the agents are very unstable. I would recommend some serious testing with these to figure out the why portainer is listing them as down. The agent is much needed.

As a side note, portiainer loses connectivity with a container's shell (using the agent) when executing certain commands such as vim. Other commands such as cat and ls work without issue. The only workaround is to access the container directly (which is what was required using the Docker environment endpoint).

alphaDev23 on 6 Apr 2019

@alphaDev23 did you specify a custom timeout value with this image? e.g. -e DOCKER_CLIENT_TIMEOUT=30 ?

2019/04/06 21:13:06 http error: Unable to ping Docker environment (err=Get https://192.168.x.x:9001/_ping: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)

This is definitely a network issue and probably not related to Portainer. The fact that you had to restart the Portainer container to fix this makes me thinks that it's related to the environment where Portainer is running and not Portainer itself.

At least for the portainerci/agent:custom-timeout agent, there appears to be an issue in the registering the endpoint where the endpoint registers and absent a portainer restart will produce the above errors.

Do you mean that you can reproduce this problem 100% of the time?

EDIT: After a brief period, a different endpoint went down. After refreshing the endpoints, it is back up but the agents are very unstable. I would recommend some serious testing with these to figure out the why portainer is listing them as down. The agent is much needed.

Seems to be another issue, I'll need the logs of the agent to determine where the issue might come from.
We're aware that the agent is much needed and we're trying our best to solve these stability issues.

I'd like to thank you again for your time helping us troubleshooting these problems.

Regarding your last point, this is a known issue in the Portainer core (vim not supported, affects agent and non agent endpoints).

deviantony on 7 Apr 2019

@deviantony: I had the same initialy by @jeysibel mentioned issue here in my local home net and it took me about three weeks to figure out whats going on and how to solve it. I'm using the portainer-agent-stack.yml to deploy the stack on some Nodes running on the same ESXi and two others on VMWare player on notebooks for reference. The solution working for me, my help any user, whose nodes use the portainer_agent overlay network on interfaces which 1) can resolve external DNS queries and 2) the resolvers answer is not NXDOMAIN for the agents service name e.g "tasks.agent".

observation:

on agent startup, the agent tries to find other agents under the AGENT_CLUSTER_ADDR by a DNS query immediately on the container startup. At this time the overlay network doesn't seem be fully established. So if this (host-)interface may resolve DNS queries, the query for "tasks.agent" was sent to my networks DNS resolver, who didn't know about it and forwarded it to my ISP. But my ISP didn't answer with NXDOMAIN, but with a set of IP's of there Honeypots/Security systems/what ever:
grafik

This DNS-query to find other agents in the swarm seem to be done only once on startup.

Then I configured my local DNS resolver (pi-hole) to block this queries, but pi-hole answers with an IP of 0.0.0.0 on blacklisted domains and the agent behavior kept the same unfunctional. So I removed the blacklist entry and (after consulting the dnsmasq man page) configured the underlaying dnsmasq to answer queries for "tasks.agent" with NXDOMAIN by adding a config file with a line
address=/tasks.agent/

After this I redeployed the portainer-agent-stack.yml and all works fine. I can add, drain, remove nodes without any further problems. At this time there are running 7 VMWare based nodes, 3 managers (to test redundancy) and 4 workers on different hosts, all really stable communicating with each other.

Conclusion:

The agents initial and only cluster nodes discovery seems to happen before the overlay network is fully settled. In a closed environment, this doesn't matter. If the host is able to forward DNS queries and the answers are different from NXDOMAIN, the agent tries to connect and gives up on failure. This agent will never communicate with other agents in the swarm.

Suggestion:

Please add way to provide a custom cluster discovery delay, may es environment variable "AGENT_DISCOVERY_DELAY" which can default to "0s", but if it is set, it will delay the agents cluster discovery for the given count in seconds.

Steps to reproduce:

1) If you face the this issue, make your networks DNS resolver to answer queries for "tasks." with NXDOMAIN and redeploy the stack
2) If you doen't face this issue, but want to prove it, make your networks DNS resolver to answer queries for "tasks." with 0.0.0.0 and redeploy the stack
3) if get an answer on a DNS query for "tasks." on even one host in your network, this host would most likely show the same behavior.

This DNS manipulation may be difficult for Cloud-Solutions, I've not tested the impact of local host files so far.

Privacy issue concern:

While the DNS query for "tasks." is forwarded to other DNS resolvers and ISP's, someone you maybe doen't want to, gets the knowledge, you are running portainer and therefore docker in your network.

Regards,
RedWolf74

Edit: Environment information:
Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false

Portainer Version: 1.20.2

RedWolf74 on 15 Apr 2019

👍1

@RedWolf74 thanks a lot for that detailed report !

I don't think that introducing a discovery delay is the best solution here, we probably need to replace the existing discovery mechanism.

deviantony on 17 Apr 2019

I think am also experiencing this. Occasionally on each node the endpoint is marked as down in Portainer and the error returned is "Endpoint is unreachable."

Logs from an agent container:

2019/06/11 17:13:34 http error: Unable to execute cluster operation (err=Get https://10.0.2.8:9001/networks: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/06/11 17:18:04 http error: Unable to execute cluster operation (err=Get https://10.0.2.8:9001/networks: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/06/11 17:28:04 http error: Unable to execute cluster operation (err=Get https://10.0.2.7:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/06/11 17:28:04 http error: Unable to execute cluster operation (err=Get https://10.0.2.7:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/06/11 17:28:04 http error: Unable to execute cluster operation (err=Get https://10.0.2.3:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
2019/06/11 18:03:04 http error: Unable to execute cluster operation (err=Get https://10.0.2.3:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)

and from the portainer container:

2019/06/11 17:08:26 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/06/11 17:08:26 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/06/11 17:08:30 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/06/11 17:08:30 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/06/11 17:08:30 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
2019/06/11 17:08:34 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=primary, URL=tcp://tasks.agent:9001) (err=Error response from daemon: )

But I am able to reach each node. Portainer version 1.20.2

threesquared on 11 Jun 2019

Hi there,
~~We have confirmed #2624 as a bug with Portainer and have a fix on the way.~~
EDIT: #2624 has been fixed and will be part of the next release.

We have also confirmed #2938 and #2937 as bugs with the Agent, and will be working to get fixes out in the next milestone.

I'm pretty confident these bugs have been the cause of a lot the instability everyone has been experiencing with the agent and endpoints becoming unavailable

Cheers

itsconquest on 13 Jun 2019

👍5

Hi there,
I have confirmed another bug related to this here: #2949

itsconquest on 18 Jun 2019

Waiting for the fixed release then, thanks for the good job you make on Portainer !

bmedici on 19 Jun 2019

Hi,

I am getting the same issue:

2019/06/21 02:43:52 http: proxy error: Invalid Docker response
2019/06/21 02:44:12 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=primary, URL=tcp://tasks.agent:9001) (err=Error response from daemon: )

Waiting for the fix then... Thanks!

alfonsodg on 21 Jun 2019

Would be nice to have a hotfix for this one.

JavierPandu on 21 Jun 2019

Is there an ETA on this fix? Numerous endpoints continue to fail even using portainerci/agent:custom-timeout (although this fix seemed to help).

alphaDev23 on 23 Jun 2019

Hi there,
We have confirmed a fix for #2624 that might help, which is available through the portainerci/portainer:develop image.

Otherwise development is still in progress on the other issues I mentioned earlier in this issue.

itsconquest on 24 Jun 2019

@alphaDev23 out of curiosity, are you simply using that image portainerci/agent:custom-timeout or did you use set a custom timeout value as well via the DOCKER_CLIENT_TIMEOUT env var?

deviantony on 24 Jun 2019

DOCKER_CLIENT_TIMEOUT is set to 30 in the stack file.

alphaDev23 on 24 Jun 2019

👍1

The portainerci/portainer:develop images does not appear to resolve the issue. After removing the portainer-agent stack on the cluster and deploying using the develop image, the endpoint is still down in the ui.

Is there an estimated date that this issue will be resolved?

alphaDev23 on 30 Jun 2019

Is there any update to a fix? This issue was opened 7 months ago and endpoints in 1.20.1 are intermittently offline for no apparent reason making it impossible at times to be productive.

alphaDev23 on 6 Jul 2019

@alphaDev23 This issue is at the top of our priority list as it is affecting a lot of people, however we can't give an exact date of when all the bugs that are causing it will be fixed. This is due to the fact there are multiple different bugs with both Portainer and the Agent contributing to unstable/offline endpoints as I have mentioned earlier/have been linking to this issue.

itsconquest on 8 Jul 2019

While I've also been struggling with this same issue, I've so far managed to fix it by using both portainerci/portainer:develop and portainer/agent:dev images, as using one or the other didn't fix it for longer periods of time, but so far having both on the development branches/tags has been running well for a few days.

Dids on 8 Jul 2019

@itsconquest Ran into the issue again last night making portainer unusable. I stopped for the day until portainer "figured it out," which occurred this morning. Maybe portainer just had a long day and needed to rest for the night.

With the issue being recognized for 7 months and without being able to give an estimated date it will be fixed, it doesn't feel like it is at the top of any priority list.

alphaDev23 on 9 Jul 2019

To be clear, its at the top of the priority list; we have just been unable to reliably reproduce until recently, which then enabled troubleshooting. We think we have found the root cause, but only time and testing will tell that.

Rgds,

Neil Cresswell

On 9/07/2019, at 1:36 PM, alphaDev23 notifications@github.com> wrote:

@itsconquesthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_itsconquest&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=0fx0h4vB56iTLpw2McH1ZD6TqG_QGpbggVOB-PfMJpM&m=xv6wk7WO5siaW7dJ7vfxJK_QDvapZ4Cf4BwjfGzin4w&s=PE7_x48w_J0pHnsJUdedHTGd7cDrb8roqdDh90RZnmI&e= Ran into the issue again last night making portainer unusable. I stopped for the day until portainer "figured it out," which occurred this morning. Maybe portainer just had a long day and needed to rest for the night.

With the issue being recognized for 7 months and without being able to give an estimated date it will be fixed, it doesn't feel like it is at the top of any priority list.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_portainer_portainer_issues_2535-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAFQ2XFKAXGNDLCWERRKWB6DP6TZEZA5CNFSM4GJORWIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZROJ2I-23issuecomment-2D509797609&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=0fx0h4vB56iTLpw2McH1ZD6TqG_QGpbggVOB-PfMJpM&m=xv6wk7WO5siaW7dJ7vfxJK_QDvapZ4Cf4BwjfGzin4w&s=Eh2aXzcxW90pI81eVk6a7e6uj1QLK-uKJh41LBKuKpw&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFQ2XFILOKM3V346DWQMBWLP6TZEZANCNFSM4GJORWIA&d=DwMCaQ&c=euGZstcaTDllvimEN8b7jXrwqOf-v5A_CdpgnVfiiMM&r=0fx0h4vB56iTLpw2McH1ZD6TqG_QGpbggVOB-PfMJpM&m=xv6wk7WO5siaW7dJ7vfxJK_QDvapZ4Cf4BwjfGzin4w&s=MQl_hJ5VDC_nbQF3gqbymEnCJKUVNkV3-J9JaYh_2JY&e=.

ncresswell on 9 Jul 2019

@ncresswell What is believed to be the root cause?

Is there any chance that this is partially related to the snapshot interval such that when the database becomes large, the snapshots begin taking longer?

alphaDev23 on 12 Jul 2019

🚀1

Just to update on this: after being on the "bleeding edge" tags for both Portainer and Agent, I've been running stable for over a week now, with no issues.

Likely unrelated, but I've definitely seen similar (possibly Docker's overlay network related?) issues with Docker Swarm, where replicated services stop communicating until restarted.
As a more concrete example, I'm still seeing this occasionally with Consul (using it with Traefik), but luckily no more with Portainer.

Dids on 12 Jul 2019

👍2

I've just got an occurance again:

portainer logs:

portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:19:27 Templates already registered inside the database. Skipping template import.
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:19:28 Instance already has defined endpoints. Skipping the endpoint defined via CLI.
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:19:28 Starting Portainer 1.20.2 on :9000
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:26 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:27 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:28 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:32 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:33 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:33 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:34 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:34 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:34 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:21:34 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.feo18n262bck@ltrubtswarm01    | 2019/07/17 07:23:37 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

Agent 1 logs:

2019/07/17 07:25:29 [INFO] serf: EventMemberJoin: 824eea7c90ff 10.0.0.17
2019/07/17 07:25:31 [INFO] serf: EventMemberJoin: 727b8946d1b9 10.0.0.15
2019/07/17 07:25:31 [INFO] serf: EventMemberJoin: b531b6c6c5df 10.0.0.4
2019/07/17 07:25:31 [INFO] serf: EventMemberJoin: ee8fa4c03dd3 10.0.0.16
2019/07/17 07:25:31 [INFO] - Starting Portainer agent version 1.2.1 on 0.0.0.0:9001 (cluster mode: true)
2019/07/17 07:25:34 [INFO] memberlist: Suspect 727b8946d1b9 has failed, no acks received
2019/07/17 07:25:38 [INFO] serf: EventMemberJoin: d42f0c645e19 10.0.0.18
2019/07/17 07:25:39 [INFO] memberlist: Marking 727b8946d1b9 as failed, suspect timeout reached (0 peer confirmations)
2019/07/17 07:25:39 [INFO] serf: EventMemberFailed: 727b8946d1b9 10.0.0.15
2019/07/17 07:25:44 [INFO] serf: EventMemberJoin: 1b956f78c2fd 10.0.0.19
2019/07/17 07:25:48 [INFO] memberlist: Suspect b531b6c6c5df has failed, no acks received
2019/07/17 07:25:55 [INFO] memberlist: Marking b531b6c6c5df as failed, suspect timeout reached (2 peer confirmations)
2019/07/17 07:25:55 [INFO] serf: EventMemberFailed: b531b6c6c5df 10.0.0.4
2019/07/17 07:25:59 [INFO] memberlist: Suspect b531b6c6c5df has failed, no acks received
2019/07/17 07:25:59 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:27:03 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:27:36 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:29:09 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:29:42 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:30:15 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:30:48 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:31:02 [ERR] memberlist: Failed fallback ping: write tcp 10.0.0.17:45596->10.0.0.19:7946: i/o timeout
2019/07/17 07:31:02 [INFO] memberlist: Suspect 1b956f78c2fd has failed, no acks received
2019/07/17 07:31:21 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:32:25 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946

Agent 2 logs: (logs grabbed 2 hours later)

2019/07/17 07:25:21 [INFO] serf: EventMemberJoin: ee8fa4c03dd3 10.0.0.16
2019/07/17 07:25:23 [INFO] serf: EventMemberJoin: 727b8946d1b9 10.0.0.15
2019/07/17 07:25:23 [INFO] serf: EventMemberJoin: f15335aeb1b7 10.0.0.13
2019/07/17 07:25:23 [INFO] serf: EventMemberJoin: b531b6c6c5df 10.0.0.4
2019/07/17 07:25:23 [INFO] serf: EventMemberJoin: 628f4e8918fb 10.0.0.12
2019/07/17 07:25:23 [INFO] - Starting Portainer agent version 1.2.1 on 0.0.0.0:9001 (cluster mode: true)
2019/07/17 07:25:27 [INFO] memberlist: Suspect f15335aeb1b7 has failed, no acks received
2019/07/17 07:25:31 [INFO] serf: EventMemberJoin: 824eea7c90ff 10.0.0.17
2019/07/17 07:25:33 [INFO] memberlist: Suspect 628f4e8918fb has failed, no acks received
2019/07/17 07:25:37 [INFO] memberlist: Marking 628f4e8918fb as failed, suspect timeout reached (2 peer confirmations)
2019/07/17 07:25:37 [INFO] serf: EventMemberFailed: 628f4e8918fb 10.0.0.12
2019/07/17 07:25:38 [INFO] serf: EventMemberJoin: d42f0c645e19 10.0.0.18
2019/07/17 07:25:39 [INFO] serf: EventMemberFailed: 727b8946d1b9 10.0.0.15
2019/07/17 07:25:44 [INFO] serf: EventMemberJoin: 1b956f78c2fd 10.0.0.19
2019/07/17 07:25:46 [INFO] memberlist: Suspect 727b8946d1b9 has failed, no acks received
2019/07/17 07:25:48 [INFO] memberlist: Suspect f15335aeb1b7 has failed, no acks received
2019/07/17 07:25:51 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:25:53 [INFO] memberlist: Suspect f15335aeb1b7 has failed, no acks received
2019/07/17 07:25:55 [INFO] memberlist: Marking b531b6c6c5df as failed, suspect timeout reached (2 peer confirmations)
2019/07/17 07:25:55 [INFO] serf: EventMemberFailed: b531b6c6c5df 10.0.0.4
2019/07/17 07:25:57 [INFO] memberlist: Marking f15335aeb1b7 as failed, suspect timeout reached (0 peer confirmations)
2019/07/17 07:25:57 [INFO] serf: EventMemberFailed: f15335aeb1b7 10.0.0.13
2019/07/17 07:25:59 [INFO] memberlist: Suspect b531b6c6c5df has failed, no acks received
2019/07/17 07:26:31 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 07:27:04 [INFO] serf: attempting reconnect to 628f4e8918fb 10.0.0.12:7946
2019/07/17 07:27:38 [INFO] serf: attempting reconnect to 628f4e8918fb 10.0.0.12:7946
2019/07/17 07:28:11 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:28:44 [INFO] serf: attempting reconnect to 628f4e8918fb 10.0.0.12:7946
2019/07/17 07:29:17 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 07:29:50 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 07:30:23 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:30:56 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:31:02 [WARN] memberlist: Refuting a suspect message (from: 1b956f78c2fd)
2019/07/17 07:31:29 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 07:32:02 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 07:32:35 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
.... more of the same ...
2019/07/17 08:53:34 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 08:54:07 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 08:54:40 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 08:55:13 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946
2019/07/17 08:55:47 [INFO] serf: attempting reconnect to f15335aeb1b7 10.0.0.13:7946
2019/07/17 08:56:20 [INFO] serf: attempting reconnect to b531b6c6c5df 10.0.0.4:7946
2019/07/17 08:56:53 [INFO] serf: attempting reconnect to 727b8946d1b9 10.0.0.15:7946

and after a while doing something else, i suddenly noticed it worked again.
But the logs still go on with the attempting to reconnect messages. There is nothing in the logs that indicate the moment where it started working again except that on the portainer logs, the unable to query endpoint messages stopped appearing after it started working again.

I'd also believe that it would be something with the swarm's overlay network not functioning correctly anymore for a while after whatever caused the node to go down and come back up.

SGStino on 17 Jul 2019

@ncresswell What is believed to be the root cause?

Is there any chance that this is partially related to the snapshot interval such that when the database becomes large, the snapshots begin taking longer?

I asked the above 8 days ago!

Issue has been re-occuring on an off for a week and now the endpoint is just falsely labelled as DOWN making productivity impossible.

Extremely frustrating and the Portainer team keeping the details of "We think we have found the root cause" secret and NO estimated resolution time (after over 7 months) is only exasperating the frustration. I thought Portainer's tag line is "Making Docker Management Easy" but this issue is not harmonious with that tag line.

alphaDev23 on 20 Jul 2019

👀1

Well portainer looks great and promising but now i am mostly using swarmpit and swirl until we got a solution, i just can't wait until the dev team got a reply... however i use portainer when in sometime works, because we got the 3 tools in the swarm.

alfonsodg on 20 Jul 2019

@alphaDev23 we're aware that this issue can be frustrating and we've parked this one aside for too long now (mainly due to other priorities and to the complexity of the problem).

Please be aware that Portainer is a free and open-source software, as such, we had to focus our resources on other topics that would help us continue to be able to fund the development of Portainer...

I believe that there is not a single root cause to this problem but instead multiple potential causes having the consequence stated in this issue. This and the multiple dimensions of Portainer deployment (deployment method, endpoint details, underlying platform) makes it hard to really pinpoint the source of the problem, even with the detailed reports of all the users that participated in the issue (greatly appreciated).

We've been able to identify multiple potential issues that would cause this problem so far, @itsconquest would you care sharing the related issues in here?

We've discussed this problem internally and decided to put a high priority on this topic once next release is done which will be out on the 26.07.2019. As part of this release, the agent will also be open-source and we also hope to get some help from the community on this topic.

deviantony on 20 Jul 2019

@deviantony I realize it is open source but keeping the potential root causes of the problem a secret doesn't begin to allow your community to attempt to make tweaks to possibly offer suggestions, logs, etc. Additionally, this issue has been outstanding for 7 months (not 7 days of even weeks). As others have noted, they "use portainer when [it works]."

Finally, piling new features, via a new release, on top of an unstable code base just leads to an unhealthy and frustrated community. Portainer is just one tool in a series of tools that is used to create and manage applications. New features and api changes often result in changes to other tools and/or scripts. Having to modify and debug ansible scripts, for example, to instantiate and debug applications in an attempt to utilize an updated but unstable platform management solution isn't the most appealing proposition.

alphaDev23 on 20 Jul 2019

👍1

but keeping the potential root causes of the problem a secret doesn't begin to allow your community to attempt to make tweaks to possibly offer suggestions, logs, etc.

We're sorry about this, we never wanted to keep this as a secret it was just a matter of bad communication from our side. We'll update this issue soon with all the related issues we've open recently that we think can be causing this unstability problem.

Finally, piling new features, via a new release, on top of an unstable code base just leads to an unhealthy and frustrated community.

We agree and we'll be working on that problem right after the next release.

deviantony on 21 Jul 2019

@alphaDev23 I have updated the description of this issue with all of the bugs contributing to this problem that we know about. I will continue to update the description as we become aware of others. I have also pinned this issue to make as many users aware as possible. Note, we are open sourcing the agent in the next release in an effort to be more transparent in our bug fixing & development process.

itsconquest on 22 Jul 2019

@deviantony, @itsconquest : @ncresswell mentioned that "We think we have found the root cause..."

What is believed to be a root cause? (I asked this question 11 days ago but still have not received an answer). The changed description only mentions that there is "no single root cause," but it would be good to attempt to eliminate one of the many possible root causes.

Also, while I'm unsure if their is a specific Docker command referenced in the #2949, using the 'docker info' command I can confirm that this does not appear to be directly related to a root cause. I see the same result when 'docker info' returns immediately from the command line on the endpoint host.

alphaDev23 on 22 Jul 2019

@alphaDev23 I told Neil that we potentially found multiple issues that could cause this problem and he stated "We think we have found the root cause..." in his comment but as I said before and as the updated description in the issue is showing I believe there is not a single root cause.

That's a bad communication problem from our side, sorry about that.

What is believed to be a root cause?

See the list of related issues in the updated description.

Also, while I'm unsure if their is a specific Docker command referenced in the #2949, using the 'docker info' command I can confirm that this does not appear to be directly related to a root cause. I see the same result when 'docker info' returns immediately from the command line on the endpoint host.

It would be better to comment in #2949 directly for this one, but basically if any of the commands defined in https://github.com/portainer/portainer/blob/develop/api/docker/snapshot.go#L13 fails, the snapshot will fail and cause the endpoint status to be set to down.

deviantony on 22 Jul 2019

@deviantony While some of the issues in the list do not appear to be immediately helpful to temporarily resolving the instability the I'm experiencing, your reference to the snapshot commands were. I had previously thought snapshots may be related and had increased the interval to 120 minutes the other day but have not seriously tested this change. Your reference suggests that it is as setting to test. I now have it set at 3600 minutes and will watch the endpoint's stability.

Has the Portainer team tested changing the snapshot setting as a temporary fix? Also, is there a way to turn off snapshots?

Are the other specific areas of code that you suspect may be related to the issue? Are there specific features that you suspect that may be immediately related to the issue and/or may be disabled in order to temporarily improve agent stability?

UPDATE: It would appear that the snapshot interval is not adhering to '3600m' as I have noticed the following two messages in the interface:

Last snapshot: 2019-07-22 18:28:29
Last snapshot: 2019-07-22 18:44:02

The appear to be related to possibly a default (?) of ~15m and not per the setting. Please advise.

alphaDev23 on 23 Jul 2019

@alphaDev23

2624 prevented you from accessing an agent endpoint after switching from another agent endpoint resource view (for example, accessing Endpoint 2 container details and then going back to Home and clicking on Endpoint 1 -> Endpoint is down in the UI).

Default snapshot time should be 5min.

If you want to try to take the snapshots out of the equation, you can start your Portainer instance using the --no-snapshot flag. This will disable background snapshots (not recommended to run as is as Portainer relies on snapshot for some features such as offline-mode or host scheduler, but if you're not using these features then you should not be impacted).

deviantony on 23 Jul 2019

👍1

UPDATE: It would appear that the snapshot interval is not adhering to '3600m'

@alphaDev23 I am going to test the snapshot not adhering to the set interval now to see if we have another bug.

itsconquest on 23 Jul 2019

The only way I can access the endpoint currently is using --no-snapshot flag. Even with this enabled, I receive errors constantly with Error: a.Snapshots[0] is undefined in the browser debug

Not sure what changed recently but it was working just fine till after the Swarm Manager restarted. When opening the volume menu I get the following error as well now.

vegasbrianc on 24 Jul 2019

Hey guys, could you guys please deploy the environment with agent endpoint's URL with tasks. excluded? Attaching an example below:

version: '3.2'

services:
  agent:
    image: portainer/agent
    environment:
      # REQUIRED: Should be equal to the service name prefixed by "tasks." when
      # deployed inside an overlay network
      AGENT_CLUSTER_ADDR: tasks.agent
      # AGENT_PORT: 9001
      # LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

  portainer:
    image: portainer/portainer
    command: -H tcp://agent:9001 --tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
    volumes:
      - portainer_data:/data
    networks:
      - agent_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

networks:
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:

Thanks a lot for your help in advance, and please ping me if you have any questions!

ssbkang on 26 Jul 2019

@ssbkang Has this been tested by the Portainer team and what do the suggested changes specifically do in order to attempt to resolve the issue?

alphaDev23 on 26 Jul 2019

@alphaDev23 yes, it was tested internally.
To explain it in a higher level, tasks.agent will resolve to IP addresses of the actual agents' container IP addresses, that is DDNS based, which is dynamic.
For agent itself is the service name, that resolves to a VIP of the service, which is static all the time. This way, Swarm will determine which agent container is healthy when Portainer makes a connection to an endpoint.

Hope this explains.

ssbkang on 27 Jul 2019

To be clear, at this point, we did not confirm that this would solve the instability issue. It's only a lead.

deviantony on 27 Jul 2019

@ssbkang @deviantony For clarity, in the agent service example, AGENT_CLUSTER_ADDR should be set to just "agent"?

The resolution proposed is focused on swarm dns issues. Have you noticed dns resolution related logs associated with the agent being down in a log? Or, is the suggested dns modification a guess which has not been tested by the Portainer team? If yes, do the logs reflect the changes?

Also, is the agent the only service that needs to be modified, i.e., just changing the AGENT_CLUSTER_ADDR variable?

alphaDev23 on 27 Jul 2019

@alphaDev23

The AGENT_CLUSTER_ADDR variable should not be altered. It should always be equal to tasks.<service_name>.

What we're trying to investigate here, is the Endpoint URL used when creating an Agent endpoint.

For example, when deploying Portainer and the agent as a stack and creating the default endpoint via the command property:

    command: -H tcp://agent:9001 --tlsskipverify

This could also be done in the UI via Endpoints > New Endpoint > Agent.

We're still unsure about this issue being caused by the agent itself, we're also investigating Portainer being not able to reach out to the agent (during a snapshot for example, causing the endpoint to go down).

Also, with the latest release of the agent (1.4.0), turning on debug logs via -e LOG_LEVEL=debug can give us more details about the agent behavior.

deviantony on 27 Jul 2019

@deviantony I'm unclear as to the answer to my previous questions (rephrased slightly here):

Have you noticed dns resolution related logs associated with the agent being down in a log?
Has the suggested dns modification been tested by the Portainer team?
Do the logs reflect the changes?

alphaDev23 on 27 Jul 2019

Have you noticed dns resolution related logs associated with the agent being down in a log?

Some users have reported the following errors in their logs:

Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?

Which means that the Portainer instance is not capable of connecting to the remote endpoint. Could be a DNS resolution issue regarding tasks.agent answering with the IP of a dead agent container (stopped, crashed, dead node...) hence the proposal to try to use the virtual service ip instead via <service_name> directly.

Has the suggested dns modification been tested by the Portainer team?

Quickly but we're still investigating it and we're looking for user feedback. This change is NOT RECOMMENDED to be applied in production systems.

Do the logs reflect the changes?

We don't know yet, we're still investigating this one and as said earlier it is still a lead. We're also looking for as much feedback as we can.

FYI we're putting all our efforts on this issue right now.

First, my recommendation would be to update to the latest version of Portainer if you can (1.22.0) as it brings the following changes/fixes:

#2624: this one fixes an issue that would cause your endpoint to go down after switching between two different agent endpoints
#2649: this one increases the timeout associated to Docker requests, preventing snapshots from failing due to long response time on the Docker environment side (volume plugins, heavy loaded environments...) and as such not marking the endpoint as down

In the second place, update the agent to the latest version and enable debug logs via -e LOG_LEVEL=debug, this will allows us to gather more information about the agent behavior.

Here is an example of stack you can use to deploy this setup (note that we're not using -H tcp://agent:9001 here as we're still not sure about the impact):

version: '3.2'

services:
  agent:
    image: portainer/agent:1.4.0
    environment:
      # REQUIRED: Should be equal to the service name prefixed by "tasks." when
      # deployed inside an overlay network
      AGENT_CLUSTER_ADDR: tasks.agent
      # AGENT_PORT: 9001
      LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

  portainer:
    image: portainer/portainer:1.22.0
    command: -H tcp://tasks.agent:9001 --tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
    volumes:
      - portainer_data:/data
    networks:
      - agent_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

networks:
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:

We're aware that this issue is impacting a lot of people and we're sorry for the lack of update on this topic. We're now trying our best to fix this but we'll probably require some help due to how many different setups seems to be impacted.

deviantony on 28 Jul 2019

The information you provided above is very helpful. On the other hand, it would be good for the Portainer team to sufficiently test any recommended solution before looking to the community to further test. Otherwise many in the community will be spending their valuable time participating in essentially a guessing game. If, however, the Portainer team does non-quick testing, the community would be assured that the time they spend on testing is appreciated.

alphaDev23 on 28 Jul 2019

Agree, we're just trying to leverage the community if they have the ability to test this with us in staging environments. Will update this issue if the virtual service IP lead is going anywhere.

deviantony on 28 Jul 2019

👍1

Update 29/07/19:
With our attention focused on this issue, we are seeking all of the feedback from the community that we can. If anyone is experiencing this issue and is available for a live debug session, we would love to get in touch! Mention deviantony or myself and we can arrange a virtual meeting!

itsconquest on 28 Jul 2019

@itsconquest For the record, I can test this in a dev/production environment but after the Portainer team has done more than quick testing on the proposed resolution. To re-setup endpoints/portainer instances which others depend on without any confidence that we are not just guessing at a solution, is not something that I, and I'm sure many others, can commit to due to time limitations. If there are, then they have way more free time than I do.

Please let me know when your team has done more investigation internally, that they have researched the issue more than has been currently done (per the conversation yesterday), what has been done, and that there is some degree of confidence, based upon the above analysis, that the proposed solution may work because the evidence to demonstrate the change is shown by X (whatever X may be). I'm confident that you will get more traction with the community if the Portainer team puts in the effort above which I assume is planned anyway given that your are focused on this issue.

alphaDev23 on 29 Jul 2019

Hey everyone,
We have created the channel #fix2535 on our community Slack server as an easier alternative to discussion on this issue. You can join our slack server here.

There may be bugs that we are unaware of and we want to make sure we cover them all.

itsconquest on 29 Jul 2019

Is there any update to this issue. Running the 1.22.0 with the latest agent (pulled today) and the the endpoints remain offline, even though docker info shows swarm is active with 3 nodes (the total in the cluster). As a note, 10.0.1.7, reference below, is not one of the nodes.

2019/08/12 02:04:27 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
2019/08/12 02:04:48 [INFO] serf: attempting reconnect to 56b71271943b 10.0.1.7:7946

After removing the node from the UI and the agent and recreating both, logs from the agent return:

2019/08/12 02:13:39 http error: Missing request signature headers (err=Unauthorized) (code=403)
2019/08/12 02:14:49 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

On Jul 29, the top post by the portainer team was: "We are now focusing all efforts towards eliminating endpoint instability within Portainer, while being as transparent as possible."

I'm not seeing that. It has been 2 weeks now and if "ALL" of your efforts were focused on this issue, there would be some update/resolution.

alphaDev23 on 12 Aug 2019

Progress to date:
29/08/19:

Anthony Lapenna set up 3 agent endpoints on Digital Ocean (henceforth mentioned as DO):
- An agent endpoint with tasks.agent
- An agent endpoint with Virtual IP agent
- An agent endpoint added with node IP address directly
Louis-Philippe described the structure of the frontend. I read this and thought about whether anything could be the cause of the agent instability issues.
I found some articles that I thought might be leads, I sent them to Steven Kang to read:

30/08/19:

Steven Kang implemented a reverse proxy and used tcpdump to investigate whether issues with IPVS and tcp keepalive timeout could help to reproduce the issue. He was not able to reproduce the issue or see any abnormal behaviour.
I checked the Digital Ocean deployments and they are up and stable

31/08/19:

Steven Kang reported that his agent setup is still stable with a reverse proxy in place
I checked the DO deployments and they are up and stable
A user Brian Christner reports:

I updated my cluster to the latest versions. First, feedback is the endpoint is more stable. I am still having issues (probably unrelated) when loading Dashboard menu as it errors out 3/5 times for unable to retrieve volumes. I have also disabled snapshots and the endpoint seems to respond better. Will keep it running and provide feedback.

Anthony Lapenna notes down Brian's iissue and says that he will investigate.
01/09/19:
I checked the DO deployments and notice errors in the logs, I reported these to Anthony Lapenna
Anthony Lapenna observed the DO swarm was configured to communicate with other nodes on the public IP, whereas the advertise address was using a private IP. We were not able to determine if that could be a cause of any problem so far. We were also seeing a few errors related to the usage of the serf library and the network but no instability was detected.

07/09/12:

I sent the articles I previously sent Steven Kang to Anthony Lapenna to read:
After an investigation into the ipvs issue, Anthony’s understanding is that it would only affect communications with a service on the VIP. However our DO VIP endpoint is stable and does not seem to be affected
Anthony Lapenna noted the tcp keepalive timeout and said that he will investigate this as well as go-proxy timeouts to see if there is an issue.

09/09/19:

Anthony Lapenna conducted a live-debug session with a user who reported endpoint instability issue but the problem was not related to the agent.
I set up my own agent deployments on my local machine running linux and vagrant VM’s and left them running. I have been using them and have not been able to reproduce the bug.
11/09/19: I checked the DO deployments and they are up and stable
12/09/19:
I opened a bug that Anthony Lapenna and myself discovered which is related to the agent, here: #3083
Anthony Lapenna is still investigating go-proxy timeout but has not yet been able to reproduce endpoint instability.
Louis-Philippe discovered a bug where the front-end was not able to change an endpoint to up after a successfull ping: #3088
• Alphadev23 reported that Portainer 1.22.0 is less stable for them than the previous Portainer version. I noted this down to investigate further.
• Anthony Lapenna let me know he is also spending time looking into a potential update to the default serf configuration shipped with the agent, to see if this helps with stability.
• Anthony Lapenna is deploying the same deployment we have on DO to AWS:
◦ An agent endpoint with tasks.agent
◦ An agent endpoint with Virtual IP agent
◦ An agent endpoint added with node IP address directly
◦ With an added endpoint: An agent endpoint with traefik as a reverse-proxy
13/09/19:
Anthony talks with alphaDev23 to debug his environment, alphadev reports stability with portainerci/agent:feat-skip-ingress image and ingress mode port mapping
14/09/19:
I opened the issue that Louis-Philippe reported related to the agent #3088
I investigated alphadev23's issue on the moby repo https://github.com/moby/moby/issues/37458 to see if it had an effect on the agent, due to our use of the host-mode port.
I found that using publish-add to add a port to the agent container resulted in the agent endpoint going down in portainer and the agents throwing this error [ERR] memberlist: Failed to send ping: write udp [::]:7946->10.255.0.29:7946: sendto: operation not permitted
After discussion with Steven I now know this is because you need to use endpoint_mode: dnsrr with host mode when adding ports to a service, otherwise the service becomes unresponsive. I noted I need to investigate this further with alphadev to see if there is another triggering factor.
I began investigating whether different timezones between nodes had an effect on the agent endpoint, based on the discussion by user Claude Robitaille in the fix2535 channel on slack
I managed to get endpoint instability with a two node swarm in vagrant; 1 manager and 1 worker. I changed the timezone of the manager and then made the worker leave the swarm and then re-join. The agent then deployed back to the worker and the endpoint was shown as down in Portainer, while still responding to pings & docker info reporting the swarm as active. This was not 100% reproduceable in my testing though.

itsconquest on 15 Aug 2019

❤3 👍2

Update 19-8-19

15-08-19:

Neil mentions he has an agent container stuck in a created state, I have also seen an agent in a created state with the error message failed to get network during CreateEndpoint. Network <xxxx...> not found so I started to investigate if this is a potential lead. If an agent could get into this state it would not be able to proxy portainers requests and thus the endpoint may show as down.
I discuss this with Anthony and he thinks it could be related to a swarm issue, and that this should be defined as an issue with agent resiilience and not instability.

16-08-19: Update from Anthony RE The issue with the agent and ports - I don't believe the endpoint mode dnsrr is actually required. If you're publishing a new port for the agent without host mode, the agent will be added into the ingress network. This can cause an issue (see my point in the fix channel about this and the new image I published). Using host mode or endpoint mode dnsrr will prevent the service from using the Swarm routing mesh (and thus the service will not use the extra ingress network) and the agent will be fine.

17-08-19: I discuss with Anthony the idea of introducing automated manipulation of the frontend to to increase the load on Portainer & the agent endpoint to see if this can trigger endpoint instability, since at this stage the agent testing environments have maintained stability through our manual testing.

18-08-19: I discussed claude's messages in the fix2535 channel with Anthony and ask if he has investigated this further. He believes this is a separate issue related to agent resilence in swarm and not a case of 'instability'

19-08-19:

Update from Anthony RE the the instability issue - **The new image allowing the use of the ingress network (removing host-mode requirement) has helped one of our users (alphaDev). It should be tested by other users having the same issue if they can. portainerci/agent:feat-skip-ingress.
It can be deployed via the following stackfile:

  version: '3.2'

  services:
    agent:
      image: portainerci/agent:feat-skip-ingress
      environment:
        AGENT_CLUSTER_ADDR: tasks.agent
        LOG_LEVEL: debug
      volumes:
        - /var/run/docker.sock:/var/run/docker.sock
        - /var/lib/docker/volumes:/var/lib/docker/volumes
      ports:
        - "9001:9001"
      deploy:
        mode: global
        placement:
          constraints: [node.platform.os == linux]

Note that this fix is only supposed to fix instability issues with remote agent endpoints (as in accessed via a nodeIP + port 9001), I don't believe it would bring any improvement with endpoint defined as tasks.portainer_agent:9001 when the portainer instance is in the same overlay network as the agent (no port exposed, no host mode).

I set up the automated testing env using cypress.io to manipulate the front end and wrote some basic tests

itsconquest on 19 Aug 2019

Update 26-08-19:
_20/08/19 - 22/08/19:_ Test environments continue to be monitored by Portainer team, remaining stable with no evidence of intermittent behaviour
_23/08/19:_

Two users report the intermittent endpoints issue on git-hub; Wjdavis5 notes that it is affecting them more on version 1.22 and Power2All reports that it is occurring on a cleanly installed Proxmox deployment specifically.
I note down that these are leads that need investigating and contact Anthony Lapenna to discuss further
User Ryada contacts Portainer via the fix2535 channel on Slack notifying us that the Proxmox deployment is theirs and they would like to help us investigate.

_24/08/19:_ I contact Ryada And Power2All to try and organize a live debug session
_25/08/19:_ I reach out to another user of Portainer that is experiencing the agent instability issue to see if they are available to have us investigate their deployment
_26/08/19:_ Anthony and Ryada do a live debug session.

User deploy's the portainer stack (official, no update), access the instance, see primary up and can access it.
Then click refresh to trigger snapshot (or wait for a snapshot to be triggered) and endpoint goes down.
Snapshot logs says "unable to reach Docker daemon at..." so unable to reach agent via tasks.agent but was working in the UI

Anthony confirms there is an instability issue in this environment that needs further investigation
Anthony suggests troubleshooting steps for user Ryada to test in their environment

itsconquest on 26 Aug 2019

👍1

I'm just chiming in here to say that we are also experiencing this issue. For us, it almost always occurs after clicking the "view logs" or "open console" features for specific containers. If you need another swarm to troubleshoot with, let me know (though it is in production and we may have to schedule in off hours).

soundstripe on 30 Aug 2019

@soundstripe are you using the latest version of Portainer? If not, you could be affected by https://github.com/portainer/portainer/issues/2624 which was fixed in 1.22.0.

deviantony on 30 Aug 2019

I thought I was...Just double checked and I was on 1.20.0. I’m updating and will let you know if the instability comes back. Thanks!

From: Anthony Lapenna notifications@github.com
Sent: Friday, August 30, 2019 1:31 PM
To: portainer/portainer
Cc: Steven James; Mention
Subject: Re: [portainer/portainer] Endpoint Instability (#2535)

@soundstripehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsoundstripe&data=02%7C01%7C%7Ca6c66e97ab6c4e4a3b7908d72d6ff281%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637027831133334963&sdata=aIQFBwPk3dTdqFYJ5I7OibYP1wlvC5rDnz0jW1dCouA%3D&reserved=0 are you using the latest version of Portainer? If not, you could be affected by #2624https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fportainer%2Fportainer%2Fissues%2F2624&data=02%7C01%7C%7Ca6c66e97ab6c4e4a3b7908d72d6ff281%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637027831133344980&sdata=9FX3R1QP3SIKEtIcm%2B621opakwqG6wz%2FF9izB2TlZ50%3D&reserved=0 which was fixed in 1.22.0.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fportainer%2Fportainer%2Fissues%2F2535%3Femail_source%3Dnotifications%26email_token%3DAAHX4JCERDU5MXOTJEEXBHTQHFKQLA5CNFSM4GJORWIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5SJERI%23issuecomment-526684741&data=02%7C01%7C%7Ca6c66e97ab6c4e4a3b7908d72d6ff281%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637027831133354979&sdata=pYmyXf2dMPJGUii6ur%2FnLdR81aFVETB8kDTRux%2BGsXs%3D&reserved=0, or mute the threadhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHX4JD5LCUNCFPUK4QNN2LQHFKQLANCNFSM4GJORWIA&data=02%7C01%7C%7Ca6c66e97ab6c4e4a3b7908d72d6ff281%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637027831133364997&sdata=47uEUjSyJmsFQYqjGlQgSYTH7BZJxj59%2FDYI%2FwsqOKE%3D&reserved=0.

soundstripe on 30 Aug 2019

👍1

Update 18/09/19
_26/09/19:_ Anthony made the following suggestion to Ryada to change his deployment config:

Make sure that ports 7946/tcp, 7946/udp and 4789/udp are open on all the nodes. For the manager node, also make sure that 2377/tcp is open.
Use the--advertise-addr when creating the cluster via docker swarm init..., use either the private IP address or NIC name directly (--advertise-addr eth1:2377for example)
Use the--advertise-addr when joining a cluster on worker nodes via docker swarm join, same as above use either private IP or NIC name directly
Then deploy the stack

_27/08/19 - 28/08/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability

_29/08/19:_ Ryada confirms the change in his deployment config has mitigated any instability issues in his environment, suggests he will continue to monitor and update if any instability is encountered.

_30/08/19:_ Ryada has found no further instability.

_31/08/19 - 03/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability

_04/09/19:_ Anthony holds a live debug session with a user reporting agent instability.

_05/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability

_06/09/19:_

Anthony confirms that user's issue from the live debug session on 04/09/19 was not related to tasks.agent usage, or Portainer but instead their Docker daemon was having issue.
Ryada confirms no further instability.
Following discussion in the fix2535 channel, Anthony suggests we reassess our recommended deployment of the agent i.e hostmode/vs ingress following the discussion with Ryada and alphadev23, and the reported improvement this mode has for alphadev in their environment

_07/09/19-8/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability

_09/09/19:_

I discuss the current status of this issue with Anthony and whether we think we are ready to move forward with an agent release. He is confident that we will be ready once we come to a decision on what the recommended deployment method is for the agent (host or ingress) and the bugs that are directly causing instability are fixed.
I begin to draft this update to cover the events from the past few days that are not mentioned in this channel

_10/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability.

_11/09/19:_

I ask for a review of this update from Anthony.
We discuss the bugs we have found so far, their severity (causing instability vs 'appear to cause instability') and what ones are going to be included in the next release of the agent.
I organise a meeting on 13/09/19 to disucss the agent instability with Anthony and Neil so that we can make a decision on the recommended deployment and which bugs are going to be included in the next release.

_12/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability.

_13/09/19:_

Meeting gets postposed to 16/09/19.
Portainer team continues to monitor deployments on DO, AWS and observes no further instability.

_14/09/19-15/09/19:_ Portainer team continues to monitor deployments on DO, AWS and observes no further instability.

_16/09/19:_

We discuss the agent instability issue and come to a decision that ingress mode should be our recommended deployment if it provides a benefit to users over host mode deployment. Anthony pointed out there is an existing issue with using host mode open on the moby repo https://github.com/moby/moby/issues/37458, if we can move to ingress then users of the Portainer agent will not encounter this problem which is one such benefit. This change in recommended deployment would be on the condition that we are sure it won't break any existing deployments of current users, meaning that we will need to complete heavy testing to ensure that it doesn't.
We discuss the bugs reported in the #2535 issue and clarify the severity of the bugs so we have a priority of what needs to be fixed before the next release:
- Bugs that are causing this instability (#2937, #2938)
- Bugs giving the impression that an environment might have the instability (#3083, #3088, #3098)
- There is proposed change to help mitigate the snapshot failing problem (#2940) which will be developed as part of support for offline mode for swarm in future.
Alongside this release we have also decided we need to make deployment instructions as clear as possible to avoid misconfiguration such as what happened with Ryada's deployment, which appeared to be instability but wasn't
Anthony publishes fixes for the bug #3098

_17/09/19:_

Anthony and I discuss issue #2938 and Anthony mentions the agent acknowledgement is very fast in his own testing (~5-10seconds).
I tested the PR for the bug #3098 and confirm that it fixes the issue
Anthony pushes PR's for the bugs #3083 and #3088
I test the PR's for #3083 and #3088 and fix the problems.
LP completes a technical review of the code in the PR's that fix #3083 and #3088 and approves them

_18/09/19:_

Anthony and I discuss the bugs #2399 and #2696 which were reported as agent instability, and fixed in version 1.4.0 of the agent https://github.com/portainer/agent/issues/23. We realized this fix was not mentioned here so I am mentioning it now for transparency.
I investigate #2938 to see if I can recreate the issue on Portainer v1.22 and Agent v1.4.0, Anthony continues to investigate this also

itsconquest on 18 Sep 2019

Update 19/09/19:

After further investigation, Anthony has deemed that issue #2938 is not related to the time it takes for an agent to acknowledge another is down, but rather due to Portainer being unable to reach an endpoint right after a node is unavailable. As such I have updated the description and title to represent this
Anthony has also been assessing the configuration of the agent to ensure there is nothing that could be causing instability issues. He found that the default behavior of the serf library the agent uses is to send a reconnect request every 30s for 24hrs. This should be tuned to be something more acceptable to match the use case of the agent, and as such Anthony has opened an issue here: https://github.com/portainer/agent/issues/78

itsconquest on 19 Sep 2019

We're seeing the exact same thing here. I'm also seeing this error:
Screen Shot 2019-09-19 at 9 53 34 AM

mcblum on 19 Sep 2019

I really don't pretend to be a troll but I love using portainer ... when it works, its interface, its functionalities make it wonderful, but in my cloud it only works 50% of the time, when it fails then we turn to swarmpit and swirl (whose interface is really awful, but they work!)
2019-09-19_10-10

alfonsodg on 19 Sep 2019

Update 20/09/19:
Following development, discussion and testing this week we are planning a release of agent 1.5.0 next week and a release of Portainer 1.22.1 following the merge of the pending issues in the 1.22.1 milestone

_Changes on the Portainer repo:_
- Issue #2940 has been split into two different issues; #2940 now covers the backend aspect of the issue and #3178 covers the frontend.
- Following testing, the PR's fixing the following bugs have been merged into the develop branch: #2940, #3083, #3088, #3098

_Changes on the Agent repo:_
- A PR for issues portainer/agent#42 & portainer/agent#43 was merged which fixed memory and file handler leaks in the agent which closes #1991 & #2254 on the Portainer repo
- A PR for issue portainer/agent#75 was merged which fixed changed the agent's handling of failed requests, which could be potentially impacting endpoint instability
- A PR for issue portainer/agent#78 was merged which changes the configuration of the agent reconnect policy I mentioned anthony was investigating in a previous update
- A PR for issue portainer/agent#79 was merged which closes the bug #2938 on the Portainer repo
- A PR for issue portainer/agent#81 was merged which introduces more logging for the agent

If you would like to test these changes by running the develop images of portainer & the agent, you can use the following stack file. Any feedback is much appreciated.

Note: the develop image of the agent is available as portainerci/agent:develop-linux-amd64 and the develop image of portainer is available as portainerci/portainer:develop-linux-amd64

version: '3.2'

services:
  agent:
    image: portainerci/agent:develop-linux-amd64
    environment:
      # AGENT_PORT: 9001
      LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

  portainer:
    image: portainerci/portainer:develop-linux-amd64
    command: -H tcp://tasks.agent:9001 --tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
    volumes:
      - portainer_data:/data
    networks:
      - agent_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

networks:
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:

itsconquest on 20 Sep 2019

I am also having stability issues with portainer which may be related to https://github.com/portainer/agent/issues/91 and https://github.com/hashicorp/serf/issues/512. I also observed that they may be related to the fact that the Docker API on a busy node does not respond quickly to requests from the agent. To me it looks like a failure or timeout of a single agent Docker API request brakes the whole Portainer endpoint / consolidated request/response feature.

mback2k on 5 Oct 2019

@mback2k give a go to the development images above. We've improved stability in the latest builds.

deviantony on 5 Oct 2019

@deviantony Are Windows variants of the developer images also available, not just Linux?

mback2k on 6 Oct 2019

I don't know if this will help anyone else but I was able to get rid of most of these issues by rebooting all of the nodes. If you haven't tried that, give it a go!

mcblum on 6 Oct 2019

I regularly reboot all of my Docker nodes due to kernel / OS updates and I can confirm that the issue persists, so this probably does not help in all situations.

mback2k on 6 Oct 2019

@deviantony When will the new agent be officially released, is there more testing required, and will it continue to work with 1.22.0?

alphaDev23 on 7 Oct 2019

@alphaDev23 we're freezing the core codebase today and aiming for a release of both the agent and the core in two days.

It should still be working 1.22.0 although we recommend upgrading to 1.22.1 as we actually tweaked a few things regarding stability (more info the upcoming release notes).

deviantony on 7 Oct 2019

@deviantony The agent will switch to ingress mode ports, correct?

Thank you for your work on the stability issues.

alphaDev23 on 7 Oct 2019

Agent will now be supported in ingress mode yes.

deviantony on 7 Oct 2019

As @mback2k already said, I still confirm that the problem persists. I use the portainerci/agent:develop-linux-amd64 and portainerci/portainer:develop-linux-amd64 image.

docker-compose.yml

version: '3.2'

services:
  agent:
    image: portainerci/agent:develop-linux-amd64
    environment:
      # REQUIRED: Should be equal to the service name prefixed by "tasks." when
      # deployed inside an overlay network
      AGENT_CLUSTER_ADDR: tasks.agent
      # AGENT_PORT: 9001
      LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      - agent_network
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

  portainer:
    image: portainerci/portainer:develop-linux-amd64
    command: -H tcp://tasks.agent:9001 --tlsskipverify
    ports:
      - "9000:9000"
      - "8000:8000"
    volumes:
      - portainer_data:/data
    networks:
      - agent_network
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

networks:
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:

Errors
I randomly get

2019/10/08 08:44:47 http: proxy error: context canceled
2019/10/08 08:44:47 http: proxy error: context canceled
2019/10/08 08:45:02 http: proxy error: context canceled
2019/10/08 08:48:08 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=primary, URL=tcp://tasks.agent:9001) (err=Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?)
2019/10/08 08:49:39 http: proxy error: context canceled
2019/10/08 08:49:39 http: proxy error: context canceled
2019/10/08 08:53:08 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=primary, URL=tcp://tasks.agent:9001) (err=Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?)

2019/10/08 08:48:56 [ERROR] [http,agent,proxy] [target_node: amd-ryz9-1] [request: /host/info] [message: unable to redirect request to specified node: agent not found in cluster]
2019/10/08 08:48:56 http error: The agent was unable to contact any other agent (err=Unable to find the targeted agent) (code=500)

2019/10/08 08:59:00 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

Additionally I restarted all Docker services on the different nodes as suggested - still no luck.

With this Portainer is almost unusable as with all actions (creating and deploying stacks, pulling new images, etc.) there is a high chance that it just doesn't work.

If more info is needed, let me know.

codedge on 8 Oct 2019

The breaking change (which solved the instability problems) was this line:

_Make sure that ports 7946/tcp, 7946/udp and 4789/udp are open on all the nodes. For the manager node, also make sure that 2377/tcp is open._

Now communication works flawlessly between all the nodes and the manager.

codedge on 8 Oct 2019

👎1

@codedge These are the standard requirements for overlay networking and that has nothing to do with this issue.

mback2k on 8 Oct 2019

Thanks for the info. Then in my case the latest changes to the agent brought the needed stability.

codedge on 8 Oct 2019

@codedge @mback2k the agent relies heavily on overlay networking to work properly. As such, having a properly setup Swarm cluster is mandatory.

This is something that we're going to add to our documentation.

deviantony on 8 Oct 2019

👍1

The latest changes also brought the needed stability for us, however it's a bit slower and we still see these warning logs

2019/10/08 23:50:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 00:30:34 [WARN] [docker,snapshot] [message: unable to snapshot Swarm services] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 00:50:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 01:30:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 01:50:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 02:15:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 03:15:34 [WARN] [docker,snapshot] [message: unable to snapshot Swarm services] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 03:20:33 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]


2019/10/09 03:40:34 [WARN] [docker,snapshot] [message: unable to snapshot Swarm services] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?]

mahmoudawadeen on 9 Oct 2019

@mahmoudawadeen this is an indication that something wrong in your Swarm environment regarding overlay networking.

It seems that Portainer can't reach out tasks.agent:9001. Sounds like a Swarm issue but I would try to use an endpoint called agent:9001 to see if service resolution is working better than task DNS resolution.

deviantony on 9 Oct 2019

👍1

I have managed to replicate the issue successfully at least in one case. I have a 6 node swarm cluster (3 managers + 3 workers). Every time I drain a worker for maintenance, after enabling it, portainer has issues to communicate with the re-deployed agent on the worker node which was emptied for maintenance. Dashboard says there are 5 servers although on "swarm" tab you can see 6. Container, images, netwroks etc numbers are wrong like one server is missing. Refreshing dashboard sometimes resolves the wrong values temporally. Only force updating the portainer_agent service makes things stable again. Hope it helps on debugging the issue.

baskinsy on 9 Oct 2019

We are about to release a new version that includes many fixes for agent instability. We will first run it through your exact scenario to make sure it fixes this issue.

ncresswell on 9 Oct 2019

👍1

Hello all,

I had the same issue on this environment:

```[root@localhost ~]# docker version
Client:
Version: 1.10.3
API version: 1.22
Package version: docker-common-1.10.3-59.el7.x86_64
Go version: go1.6.2
Git commit: 429be27-unsupported
Built: Fri Nov 18 17:03:44 2016
OS/Arch: linux/amd64

Server:
Version: 1.10.3
API version: 1.22
Package version: docker-common-1.10.3-59.el7.x86_64
Go version: go1.6.2
Git commit: 429be27-unsupported
Built: Fri Nov 18 17:03:44 2016
OS/Arch: linux/amd64
[root@localhost ~]# docker-compose version
docker-compose version 1.22.0, build f46880fe
docker-py version: 3.4.1
CPython version: 3.6.6
OpenSSL version: OpenSSL 1.1.0f 25 May 2017

portainer 1.22.0.
```

I didn't fixed the issue but i improved the crash period by modifying the default RAM dedicated to containers.

echo "vm.max_map_count=262144" >> /etc/sysctl.conf
It will be active for the next reboot. If you don't want reboot, just play:
sysctl -p
Then reboot docker

Now, the endpoint is up for a longer period. It crashes after 5 min instead 30s

HugoLS on 9 Oct 2019

@mahmoudawadeen this is an indication that something wrong in your Swarm environment regarding overlay networking.

It seems that Portainer can't reach out tasks.agent:9001. Sounds like a Swarm issue but I would try to use an endpoint called agent:9001 to see if service resolution is working better than task DNS resolution.

@deviantony thanks for the tip, this made the warning go away. However, it didn't make portainer faster. But everything is working fine except loading slowly.

mahmoudawadeen on 10 Oct 2019

@mahmoudawadeen this is an indication that something wrong in your Swarm environment regarding overlay networking.
It seems that Portainer can't reach out tasks.agent:9001. Sounds like a Swarm issue but I would try to use an endpoint called agent:9001 to see if service resolution is working better than task DNS resolution.

@deviantony thanks for the tip, this made the warning go away. However, it didn't make portainer faster. But everything is working fine except loading slowly.

@deviantony Aaaaand that's what happens when you're too eager for a problem to go away.

019/10/10 06:13:27 [WARN] [docker,snapshot] [message: unable to snapshot Swarm services] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://agent:9001. Is the docker daemon running?]


2019/10/10 06:38:26 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://agent:9001. Is the docker daemon running?]


2019/10/10 07:03:26 [WARN] [docker,snapshot] [message: unable to snapshot engine information] [endpoint: STAGE-Swarm] [err: Cannot connect to the Docker daemon at tcp://agent:9001. Is the docker daemon running?]

mahmoudawadeen on 10 Oct 2019

@mahmoudawadeen yes, there is definitely something wrong inside your Swarm environment regarding networking.

Here are my recommendations regarding Swarm setup, ensure that you have followed these steps when creating the Swarm cluster:

Make sure that ports 7946/tcp, 7946/udp and 4789/udp are open on all the nodes. For the manager node, also make sure that 2377/tcp is open.
Use the --advertise-addr option when creating the cluster via docker swarm init..., use either the private IP address or NIC name directly (--advertise-addr eth1 for example)
Use the --advertise-addr when joining a cluster on worker nodes via docker swarm join, same as above use either private IP or NIC name directly

Then deploy the Portainer stack

deviantony on 10 Oct 2019

I have upgraded my portainer swarm installation to 1.22.1 and tried to drain a worker node and re-activate it. Agent was redeployed and started but again I'm facing the semi-missing node issue on portainer and wrong values are displayed on dashboard. Nodes in cluster is on 5 when on Swarm menu there are all 6 of them and dashboard values is like they don't count containers, volumes etc etc that are on the semi-missing node. Force updating the agent service resolves the issues and brings things back to normal.

baskinsy on 11 Oct 2019

@deviantony with regards to https://github.com/portainer/portainer/issues/2535#issuecomment-539664941: my Swarm networks are working fine and all required ports are accessible between the nodes.

BTW, I think there is a misunderstanding regarding published ports for the agent. I am not publishing the ports due to this issue. It would not make any sense to have published ports as a requirements. Publishing ports is only needed to make them accessible from the outside (container) world. Inside an (overlay) network the containers can reach each over via any port, just like multiple machines being on the same LAN.

mback2k on 14 Oct 2019

Hi there everyone,
In case you missed it, Portainer version 1.22.1 is out and includes several bug fixes aimed at improving the stability of endpoints, particularly Agent enabled endpoints.

Release notes: https://github.com/portainer/portainer/releases/tag/1.22.1

Alongside this, version 1.5.1 and 1.5.0 of the Agent are out and bring a lot of stability improvements.

Release notes:
- https://github.com/portainer/agent/releases/tag/1.5.1
- https://github.com/portainer/agent/releases/tag/1.5.0

If you are still experiencing instability on the latest version of Portainer & the Agent, feel free to reach out to us as we will happily walk through the issue with you.

Many thanks from the Portainer team, have a great day!

itsconquest on 16 Oct 2019

@mahmoudawadeen yes, there is definitely something wrong inside your Swarm environment regarding networking.

Here are my recommendations regarding Swarm setup, ensure that you have followed these steps when creating the Swarm cluster:

Make sure that ports 7946/tcp, 7946/udp and 4789/udp are open on all the nodes. For the manager node, also make sure that 2377/tcp is open.

Use the --advertise-addr option when creating the cluster via docker swarm init..., use either the private IP address or NIC name directly (--advertise-addr eth1 for example)

Use the --advertise-addr when joining a cluster on worker nodes via docker swarm join, same as above use either private IP or NIC name directly

Then deploy the Portainer stack

@deviantony This is already the case for us, this is how we setup our swarm. I will try to resetup the swarm again and see if we missed something during the setup.

mahmoudawadeen on 18 Oct 2019

I have upgraded my portainer swarm installation to 1.22.1 and tried to drain a worker node and re-activate it. Agent was redeployed and started but again I'm facing the semi-missing node issue on portainer and wrong values are displayed on dashboard. Nodes in cluster is on 5 when on Swarm menu there are all 6 of them and dashboard values is like they don't count containers, volumes etc etc that are on the semi-missing node. Force updating the agent service resolves the issues and brings things back to normal.

Same behavior with portainer 1.22.1 and agent 1.5.1

baskinsy on 18 Oct 2019

👍3

Same behavior with portainer 1.22.1 and agent 1.5.1

RDLRA on 28 Oct 2019

Hi there,
I was unable to reproduce this on a 3 node swarm running portainer 1.22.1 and agent 1.5.1. Hosts are running docker version 18.03.0-ce

I tested this as follows:

Navigate to the cluster overview & set worker to drain mode
Navigate to dashboard view & see 2 nodes in cluster + correct amount of resources are shown
Navigate to cluster overview & set drained node to active in UI
Navigate to dashboard view & see 3 nodes in cluster + correct amount of resources are shown

Let me know if there is anything you have done differently and I can try and reproduce this again

itsconquest on 5 Nov 2019

I can stil reproduce issues as soon as one Docker node is under heavy load and requests to the Docker daemon are taking long or timing out. As soon as the CPU load goes down on the node and the Docker CLI can reach the daemon agent, the Agent is also able to reach it again and everything runs smoothly. As soon as a single Agent in the cluster is unable to reach its Docker daemon, the whole Portainer instance becomes either extremely slow or unresponsive.

mback2k on 5 Nov 2019

I can stil reproduce issues as soon as one Docker node is under heavy load and requests to the Docker daemon are taking long or timing out. As soon as the CPU load goes down on the node and the Docker CLI can reach the daemon agent, the Agent is also able to reach it again and everything runs smoothly. As soon as a single Agent in the cluster is unable to reach its Docker daemon, the whole Portainer instance becomes either extremely slow or unresponsive.

the problem is therefore related to the loading of the nodes in some moments which causes the event to time out and therefore does not return the information of the nodes and containers.
correct?
so ther'isnt solution?

RDLRA on 5 Nov 2019

the problem is therefore related to the loading of the nodes in some moments which causes the event to time out and therefore does not return the information of the nodes and containers.
correct?

Yes, the Agent and Portainer gets stuck waiting for a response from the Docker API.

so ther'isnt solution?

I think there is a need to handle such situations gracefully, for example:

Make the Agent time out after maybe 5 seconds.
If an Agent timed out, show either cached or partial results.
If only cached or partial results are shown, display a warning to the user in Portainer.

That would be much better than being completely unable to use Portainer in such a situation. At the moment a single stuck Agent prevents Portainer from showing any results.

mback2k on 5 Nov 2019

Hi there,
I was unable to reproduce this on a 3 node swarm running portainer 1.22.1 and agent 1.5.1. Hosts are running docker version 18.03.0-ce

I tested this as follows:
1. Navigate to the cluster overview & set worker to drain mode

2. Navigate to dashboard view & see 2 nodes in cluster + correct amount of resources are shown

3. Navigate to cluster overview & set drained node to active in UI

4. Navigate to dashboard view & see 3 nodes in cluster + correct amount of resources are shown
Let me know if there is anything you have done differently and I can try and reproduce this again

Very strange... I have this issue on 3 different clusters, all with completely open ports (no firewalls) and protainer installed with the same deployment file.... The only difference is that all have at least 6 nodes (3 managers and extra workers) but it happens if i drain any of them.... Several other things are working as expected so I don't think it is a swarm setup issue or configuration... Don't know how to debug further.

baskinsy on 5 Nov 2019

@baskinsy can you share the logs of the agent with us? We might be able to identify the cause of the problem from here.

deviantony on 5 Nov 2019

@deviantony I think the issue is happening only if you restart docker daemon or reboot the drained node. I'll check further as soon as I find the opportunity and report back with logs.

baskinsy on 7 Nov 2019

👍2

I solved our Issue. And I now think it is different. My problem was a volume plugin spec pointing to a socket that did not exist anymore because we removed the daemon. After removing the specs portainer runs fine. But I still think this should not break the whole agent.

How did you remove it and resolved this issue, could you provide the steps please?

amirakhan on 14 Apr 2020

Activity on this issue seems to have slowed down after all the changes we made to get Portainer and the agent more stable. As such, I have moved discussion to the agent repo as the problem that continues to be reported now is the agent is not resilient, so the agent should be made more resilient.

If there are new issues that arise with endpoints in Portainer being unstable, we can re-open this issue. Otherwise Please report issues with the agent here

itsconquest on 20 Apr 2020

👍2

Thank you for your efforts on this issue.

alphaDev23 on 20 Apr 2020

👍3

Any update on this?