template: continue monitoring runner even with change_mode=noop by tgross · Pull Request #28016 · hashicorp/nomad

tgross · 2026-05-20T15:06:36Z

If all the templates in a task have change_mode = "noop", the task template manager returns after the initial rendering under the assumption that we no longer need to monitor the template runner. But if a template with a Consul or Vault dependency loses contact with the upstream service for long enough for the client.template.consul_retry or client.template.vault_retry to expire, the template runner exits. If a task is monitoring its own template contents and not relying on change_mode, the task ends up with stale content, and its up to applications to handle that stale content safely.

But this presents a problem when connectivity is restored, because now the template runner has exited and will never be restarted. So the application that may have been able to handle stale content for the configured consul_retry or vault_retry duration will no loner get template updates, rather than being killed and restarted. This turns a "loud" outage that's visible to Nomad into a "silent" outage that's only visible to the application, which is generally a bad situation.

Drop the optimization that stops the monitoring of the template runner.

Fixes: https://hashicorp.atlassian.net/browse/NMD-1487

Testing & Reproduction steps

In addition to the unit test I've added here, which fails without the patch, you can reproduce this with a single Consul and Nomad node. (Note this same issue can appear with Vault too, but this is a little easier to setup.)

Start Consul and Nomad. Nomad can be in dev mode but Consul should be in normal mode so that you can restart it. Use the following in the Nomad client template configuration:

client {
  template {
    consul_retry {
      backoff     = "50ms"
      attempts    = 2
      max_backoff = "5s"
    }
  }
}

Add data to Consul KV: consul kv put "nomad/name" Tim. After configuring Consul and Nomad ACLs so that Nomad tasks can read from the nomad/ KV prefix, run the following job:

jobspec

job "example" {

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      consul {}

      template {
        data =       <<EOT
<html>
  <div>hello, {{key "nomad/name"}}</div>
</html>
        EOT


        destination = "${NOMAD_TASK_DIR}/index.html"
        change_mode = "noop"
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

You can curl the allocation's port to see the KV data:

$ curl  192.168.1.194:23495
<html>
  <div>hello, Tim</div>
</html>

Then stop Consul and wait to see:

2026-05-20T10:13:00.390-0400 [WARN] agent: (view) kv.block(nomad/name): Get "http://localhost:8500/v1/kv/nomad/name?index=72&stale=&wait=300000ms": dial tcp [::1]:8500: connect: connection refused (retry attempt 1 after "50ms")
2026-05-20T10:13:00.390-0400 [ERROR] agent: (runner) sending server error back to caller
2026-05-20T10:13:00.441-0400 [WARN] agent: (view) kv.block(nomad/name): Get "http://localhost:8500/v1/kv/nomad/name?index=72&stale=&wait=300000ms": dial tcp [::1]:8500: connect: connection refused (retry attempt 2 after "100ms")

Then restart Consul and update the the KV: consul kv put "nomad/name" Austin. This will not be updated if you curl the allocation again.

With this patch, the task instead is killed.

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation strictly speaking this doesn't need docs, but I think I'll open a PR explaining the difference between template.change_mode = "noop" vs template.once = true which comes into play here

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

If all the templates in a task have `change_mode = "noop"`, the task template manager returns after the initial rendering under the assumption that we no longer need to monitor the template runner. But if a template with a Consul or Vault dependency loses contact with the upstream service for long enough for the `client.template.consul_retry` or `client.template.vault_retry` to expire, the template runner exits. If a task is monitoring its own template contents and not relying on `change_mode`, the task ends up with stale content, and its up to applications to handle that stale content safely. But this presents a problem when connectivity is restored, because now the template runner has exited and will never be restarted. So the application that may have been able to handle stale content for the configured `consul_retry` or `vault_retry` duration will no loner get template updates, rather than being killed and restarted. This turns a "loud" outage that's visible to Nomad into a "silent" outage that's only visible to the application, which is generally a bad situation. Drop the optimization that stops the monitoring of the template runner. Fixes: https://hashicorp.atlassian.net/browse/NMD-1487

tgross added theme/template type/bug backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/ent/1.11.x+ent backport to 1.11.x+ent release line backport/2.0.x backport to 2.0.x release line labels May 20, 2026

vercel Bot deployed to Preview May 20, 2026 15:07 View deployment

tgross force-pushed the NMD1487-fatal-error-template-noop branch from 876b7e2 to 5ec41b4 Compare May 20, 2026 15:09

vercel Bot deployed to Preview May 20, 2026 15:10 View deployment

tgross force-pushed the NMD1487-fatal-error-template-noop branch from 5ec41b4 to 9cd1469 Compare May 20, 2026 15:18

vercel Bot deployed to Preview May 20, 2026 15:19 View deployment

tgross added this to the 2.0.x milestone May 20, 2026

tgross marked this pull request as ready for review May 20, 2026 17:38

tgross requested review from a team as code owners May 20, 2026 17:38

tgross requested review from gulducat, pkazmierczak and tehut May 20, 2026 17:38

schmichael approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

template: continue monitoring runner even with change_mode=noop#28016

template: continue monitoring runner even with change_mode=noop#28016
tgross wants to merge 1 commit into
mainfrom
NMD1487-fatal-error-template-noop

tgross commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tgross commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing & Reproduction steps

Contributor Checklist

Reviewer Checklist

Changes to Security Controls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgross commented May 20, 2026 •

edited

Loading