Nomad: Nomad job using template stanza permanently dead after transient consul errors

Created on 6 May 2017  ·  29Comments  ·  Source: hashicorp/nomad

Nomad version

Output from nomad version

Nomad v0.5.6

Operating system and Environment details

Ubuntu 16.04 LTS

Issue

Every week my job reproducably turns "dead":

ID            = scorecard
Name          = scorecard
Type          = service
Priority      = 50
Datacenters   = cbk1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         2       0         0        0       0         0

Allocations
No allocations placed

I am not sure yet what the cause is, that's why I provide so much info and logs here.

I suspect it could be connected with the fact that all nodes in the nomad cluster reboot regularly, automatically.

Nomad cluster looks like this:

syseleven@nomad-servicehost1:~$ consul members
Node                 Address            Status  Type    Build  Protocol  DC
nomad-loadbalancer0  192.168.2.13:8301  alive   client  0.8.1  2         cbk1
nomad-node0          192.168.2.15:8301  alive   client  0.8.1  2         cbk1
nomad-node1          192.168.2.14:8301  alive   client  0.8.1  2         cbk1
nomad-servicehost0   192.168.2.11:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost1   192.168.2.10:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost2   192.168.2.12:8301  alive   server  0.8.1  2         cbk1

servicehosts are consul and nomad servers, the rest are consul and nomad clients.

Reproduction steps

Not sure yet

Nomad Server & Client logs (if appropriate)

see attachment
scorecard.log.txt

Job file (if appropriate)

job "scorecard" {
  region      = "global"
  datacenters = ["cbk1"]
  priority    = 50

  type = "service"

  group "web" {
    count = 2

    task "scorecard-nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        volumes = [
          "default.conf:/etc/nginx/conf.d/default.conf",
          "public_html:/var/www"
        ]
        port_map {
          http = 80
        }
      }

      service {
        name = "scorecard"
        port = "http"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        cpu    = 2000 # 2 GHz
        memory = 256 # 1GB
        network {
          mbits = 10
          port "http" {
          }
        }
      }

      template {
        destination   = "default.conf"
        change_mode   = "restart"
        data          = <<EOH

                        server {
                            listen       80;
                            server_name  localhost;

                            error_page   500 502 503 504  /50x.html;
                            index index.html;
                            location / {
                                root   /var/www;
                            }
                            location = /50x.html {
                                root   /usr/share/nginx/html;
                            }
                        }
                        EOH
      }

      artifact {
        source = "{{ HTTPS_ARTIFACT_URL }}/scorecard/public_html.tgz"
        destination = "."
      }

      template {
        source        = "public_html/index.html.tpl"
        destination   = "public_html/index.html"
        change_mode   = "noop"
      }
    }
  }
}
stagneeds-investigation themtemplate typbug

Most helpful comment

Just to update, I would like to get this fixed in 0.6.1

All 29 comments

Have you filtered the logs in some way? It is very hard to tell what is happening?

@dadgar I am now monitoring that service to know the exact time it stops working. Next week I can then find out from the logs exactly what happened as it happens every week.

@dadgar OK, like clockwork it happened again :) The services are down again since 06:48:15 UTC

nomad-loadbalancer0:~# nomad status
ID                 Type     Priority  Status
gitlab-runner      service  100       running
hubot              service  50        running
nginx              service  50        dead
rocketchat         service  50        running
scorecard          service  50        dead
scorecard-counter  batch    50        running

nginx should be running on nomad-loadbalancer, but it is "dead"; also scorecard is not running.

A nomad stop {nginx,scorecard}; nomad run {nginx,scorecard}.nomad would now fix the problem

From what I see there could have been transient network problems for a few minutes at that time. Probably consul returned 500s because it could not reach any server.

Some logs on nomad-loadbalancer0 (time is in utc):

May 14 06:42:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:20.140961 [DEBUG] http: Request /v1/agent/servers (202.961µs)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669056 [DEBUG] client: updated allocations at index 104937 (total 4) (pulled 0) (filtered 4)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669228 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 4)
May 14 06:42:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:30.142007 [DEBUG] http: Request /v1/agent/servers (28.333µs)
May 14 06:42:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:40.143386 [DEBUG] http: Request /v1/agent/servers (357.05µs)
May 14 06:42:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:50.145011 [DEBUG] http: Request /v1/agent/servers (27.24µs)
May 14 06:43:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:00.146532 [DEBUG] http: Request /v1/agent/servers (241.384µs)
May 14 06:43:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:10.147919 [DEBUG] http: Request /v1/agent/servers (226.812µs)
May 14 06:43:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:20.149039 [DEBUG] http: Request /v1/agent/servers (31.036µs)
May 14 06:43:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:30.150695 [DEBUG] http: Request /v1/agent/servers (26.596µs)
May 14 06:43:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:40.152461 [DEBUG] http: Request /v1/agent/servers (273.432µs)
May 14 06:43:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:50.154147 [DEBUG] http: Request /v1/agent/servers (210.807µs)
May 14 06:44:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:00.155406 [DEBUG] http: Request /v1/agent/servers (289.979µs)
May 14 06:44:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:10.156772 [DEBUG] http: Request /v1/agent/servers (37.66µs)
May 14 06:44:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:20.158156 [DEBUG] http: Request /v1/agent/servers (30.768µs)
May 14 06:44:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:30.159419 [DEBUG] http: Request /v1/agent/servers (220.105µs)
May 14 06:44:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:40.160583 [DEBUG] http: Request /v1/agent/servers (31.92µs)
May 14 06:44:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:50.162294 [DEBUG] http: Request /v1/agent/servers (257.235µs)
May 14 06:45:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:00.163968 [DEBUG] http: Request /v1/agent/servers (204.202µs)
May 14 06:45:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:10.165050 [DEBUG] http: Request /v1/agent/servers (27.475µs)
May 14 06:45:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:20.166408 [DEBUG] http: Request /v1/agent/servers (340.736µs)
May 14 06:45:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:30.168033 [DEBUG] http: Request /v1/agent/servers (28.915µs)
May 14 06:45:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:40.169261 [DEBUG] http: Request /v1/agent/servers (241.882µs)
May 14 06:45:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:50.170523 [DEBUG] http: Request /v1/agent/servers (243.546µs)
May 14 06:46:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:00.173718 [DEBUG] http: Request /v1/agent/servers (29.964µs)
May 14 06:46:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:10.175084 [DEBUG] http: Request /v1/agent/servers (247.54µs)
May 14 06:46:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:20.176312 [DEBUG] http: Request /v1/agent/servers (32.698µs)
May 14 06:46:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:30.180016 [DEBUG] http: Request /v1/agent/servers (2.268684ms)
May 14 06:46:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:40.184494 [DEBUG] http: Request /v1/agent/servers (33.153µs)
May 14 06:46:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:50.185996 [DEBUG] http: Request /v1/agent/servers (32.395µs)
May 14 06:47:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:00.187530 [DEBUG] http: Request /v1/agent/servers (37.275µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:10.189581 [DEBUG] http: Request /v1/agent/servers (38.058µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(scorecard|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/toilworkPercent): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/midonetCountdown): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/lastIncident): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/errorbudget): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:20.191015 [DEBUG] http: Request /v1/agent/servers (29.747µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:30.192322 [DEBUG] http: Request /v1/agent/servers (31.648µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:40.193297 [DEBUG] http: Request /v1/agent/servers (36.328µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:46.091343 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: stream closed

Nomad did not log anything else on that host after 06:47:46 UTC

Any other ideas where to look for what went wrong?

I can confirm there was a transient network issue at that time and the virtual machines were probably isolated from each other for up to 10 minutes.

nginx and scorecard services have in common that they are using consul_template service discovery and / or key value store functionality. All other services don't do that.

the consul cluster is now healthy by the way, but wasn't at the time the service stopped. Expected behaviour is

  1. if you start the service for the first time and consul is not available, nomad waits until consul is available, renders the templates and spawns the jobs.
  2. if the service is running, consul_template already rendered a file and if consul is not reachable nomad does simply nothing and leaves the service as-is, waiting for consul to become available again (leaving the job with potentially stale configuration instead of killing it)

also I'd expect recovery once consul returns to functioning state even if 2 is not the case ...

This is a no-go for production IMO

@dadgar sorry for my last comment, not very helpful of me to complain .... do you have enough info to understand what's going on?

@stefreak Thanks for the report. I understand what is happening now. We have a default retry rate that is not exposed. For 0.6.0 I would like to expose the retry rate so it can be extended and optionally set to retry indefinitely.

OK, thanks, makes sense.

What I don't understand is why it's not the default to retry indefinitely, maybe with a maximum retry interval like 5 minutes? IMO transient errors should not result in permanent loss of service ...

Also, failing to contact consul should not kill the job – it does not if I use consul_template directly.

The defaults of consul-template, together with a traditional setup like with systemd's Retry=on-failure will lead to indefinite retry and an exponential backoff with maximum wait time of 8 seconds which is pretty reasonable IMO.

If a nomad job fails for any reason, nomad (at least with restart mode delay) will also try to restart it forever, which is the default and IMO should also be the default for the template stanza.

@dadgar exposing retry will not be enough to fix all instances of this, now I encountered another instance of this issue with my patched version of nomad (retries forever)

For some reason it still stopped retrying...

I think I did not see the message yamux: keepalive failed: i/o deadline reached before

(background is the VM had stalled CPUs and I/O issues because of a hypervisor problem)

May 28 06:47:18 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:18.348451 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.376923 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.575040 [DEBUG] http: Request /v1/agent/servers (255.197µs)
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391676 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391723 [ERR] client: heartbeating failed. Retrying in 18.876272335s: failed to update status: 3 error(s) occurred:
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.393873 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["cbk1"]
May 28 06:47:33 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:33.576862 [DEBUG] http: Request /v1/agent/servers (24.23µs)
May 28 06:47:43 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:43.580115 [DEBUG] http: Request /v1/agent/servers (1.700357ms)
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50.918163 [ERR] client.consul: error discovering nomad servers: 3 error(s) occurred:
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: EOF
May 28 06:47:53 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:53.581774 [DEBUG] http: Request /v1/agent/servers (183.909µs)
May 28 06:47:59 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:47:59 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")
May 28 06:48:01 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:01 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: rpc error: stream closed) (retry attemp
May 28 06:48:03 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:03.583229 [DEBUG] http: Request /v1/agent/servers (32.213µs)
May 28 06:48:09 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:09 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: No cluster leader) (retry attempt 4 aft
May 28 06:48:13 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:13.584402 [DEBUG] http: Request /v1/agent/servers (25.852µs)
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164905 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164933 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.585810 [DEBUG] http: Request /v1/agent/servers (179.973µs)
May 28 06:48:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:28.174336 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:48:29 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:48:29 [ERR] yamux: keepalive failed: i/o deadline reached
lines 956-1001/1001 (END)

@dadgar this time restarting nomad resolved the issue what can I do to get more debug info next time?

@stefreak Your Consul and Nomad cluster lost quorum. The retry behavior would also have covered this:

May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")

Restarting Nomad was just coincidence as the problem with the template was Consul queries were failing because of lack of quorum.

@dadgar I think this is not covered, because this was using my patched version of nomad with unlimited retry on all servers and clients: https://github.com/stefreak/nomad/tree/v0.5.6-patched

@stefreak At least in the logs provided, it does not appear like the consul-template gave up on retrying. it was just that it could not render because Consul did not have a leader.

@dadgar I captured the logs one day later and there were no more log entries (lines 956-1001/1001 (END))

So somehow nomad locked up and the job was down, I think for another reason.

Not sure if I can reproduce this behaviour though, if I have a lot of time some day I will try by simulating a very slow / locked up storage.

I believe that I've run into this issue as well. Nomad 0.5.6. Consul 0.8.5.

I updated my test cluster yesterday to Consul 0.8.5 and noticed my service jobs died around that same time. The previous allocations have all been GCd this morning, so I just have the remote logs to go on.

The logs show attempts with retries to the health and KV APIs all reaching maximum retries at/around 5s:

2017/07/05 21:27:35 [ERR] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (exceeded maximum retries)
....
2017/07/05 21:27:31 [WARN] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (retry attempt 5 after "4s")

We regularly upgrade Consul to each point release. We've had these service jobs end up dead like this in the past, but haven't correlated it to the consul agent restart. I'll watch for it more closely on future upgrades.

This is also a test cluster with smaller consul and nomad instances and the baseline raft performance settings for those smaller systems. The larger production systems may be able to restart the local consul agent within the ~5s seconds worth of retries currently available.

I just did some tests for this on a nomad worker in my test cluster and can reproduce the issue.

With 2 task groups of the same application running on 2 different workers, introduce consul agent availability issues on one of the workers....

Test 1: Restart the local consul agent (using systemd "restart")
This was handled with no issues reported and both task groups remained healthy.

Test 2: Stop the local consul agent, wait ~7s, start the local consul agent
This resulted in the tasks failing on the worker node.
Logs show similar patterns of retries through approximately 5s and then failure.

This job is a "service" job and does not currently specify a "restart" stanza. I'm looking at the restart stanza docs and I think the default restart policy should apply here.

I tried adding a "restart" stanza to the task groups in my test job, but that doesn't change the behavior.

The "Test 2" scenario still fails tasks every time. I've also noticed that the "Test 1" scenario will sometimes fail tasks depending on how long it takes the consul agent to come back up.

@stefreak @dadgar Let me know if there's anything else I can test or repro for this.

Having the consul-template retry controls available would be great for tuning certain jobs, but I'd also like to understand if nomad should be restarting these tasks once the local consul agent comes up again.

Should the nomad agent's overall consul retry options be configurable (more than 5 retries / 5 seconds, etc)??

Should the tasks be killed in the first place because the consul agent is down? Mine use the docker driver, if it matters.

@stevenscg thanks for reproducing it.

The issue is clear and it's also clear how to fix, I wanted to do it originally but I did not find the time yet.

If anyone wants to do it, feel free. Discussion on how to do it can be found here:
https://github.com/hashicorp/nomad/pull/2665

Updated Consul agents on our prod cluster and ran into this again even with the bigger/faster instances. Was expecting it, so quickly re-deployed all the jobs.

Just to update, I would like to get this fixed in 0.6.1

@dadgar does that mean you are working on it right now? Just asking so we don't duplicate the effort

@stefreak Nope haven't started yet. Busy with polishing 0.6.0 and have other things I will be personally tackling in 0.6.1

Hit this issue but with a slightly different behaviour. I have couple tree | explode calls to consul keys to render a template and as soon as I try to add a new key that has subkeys expected by the template, the allocations fail to render the template and even with the change_mode="noop", all the allocations become DEAD and only resubmitting the job can make them run again. I'd expect Nomad to stop / restart the allocation only if the template renders successfully.

Worth mention I'm on Nomad v0.8.0

Any updates on this? I had a full Nomad outage because we lost Consul leadership and all the jobs running entered in the failed state because Nomad was unable to read from the Consul K/V to render the templates.

It would be nice to have a way to tell Nomad to do nothing to running allocations if template rendering fails.

Was this page helpful?
0 / 5 - 0 ratings