Skip to content

template: continue monitoring runner even with change_mode=noop#28016

Open
tgross wants to merge 1 commit into
mainfrom
NMD1487-fatal-error-template-noop
Open

template: continue monitoring runner even with change_mode=noop#28016
tgross wants to merge 1 commit into
mainfrom
NMD1487-fatal-error-template-noop

Conversation

@tgross
Copy link
Copy Markdown
Member

@tgross tgross commented May 20, 2026

If all the templates in a task have change_mode = "noop", the task template manager returns after the initial rendering under the assumption that we no longer need to monitor the template runner. But if a template with a Consul or Vault dependency loses contact with the upstream service for long enough for the client.template.consul_retry or client.template.vault_retry to expire, the template runner exits. If a task is monitoring its own template contents and not relying on change_mode, the task ends up with stale content, and its up to applications to handle that stale content safely.

But this presents a problem when connectivity is restored, because now the template runner has exited and will never be restarted. So the application that may have been able to handle stale content for the configured consul_retry or vault_retry duration will no loner get template updates, rather than being killed and restarted. This turns a "loud" outage that's visible to Nomad into a "silent" outage that's only visible to the application, which is generally a bad situation.

Drop the optimization that stops the monitoring of the template runner.

Fixes: https://hashicorp.atlassian.net/browse/NMD-1487

Testing & Reproduction steps

In addition to the unit test I've added here, which fails without the patch, you can reproduce this with a single Consul and Nomad node. (Note this same issue can appear with Vault too, but this is a little easier to setup.)

Start Consul and Nomad. Nomad can be in dev mode but Consul should be in normal mode so that you can restart it. Use the following in the Nomad client template configuration:

client {
  template {
    consul_retry {
      backoff     = "50ms"
      attempts    = 2
      max_backoff = "5s"
    }
  }
}

Add data to Consul KV: consul kv put "nomad/name" Tim. After configuring Consul and Nomad ACLs so that Nomad tasks can read from the nomad/ KV prefix, run the following job:

jobspec
job "example" {

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      consul {}

      template {
        data =       <<EOT
<html>
  <div>hello, {{key "nomad/name"}}</div>
</html>
        EOT


        destination = "${NOMAD_TASK_DIR}/index.html"
        change_mode = "noop"
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

You can curl the allocation's port to see the KV data:

$ curl  192.168.1.194:23495
<html>
  <div>hello, Tim</div>
</html>

Then stop Consul and wait to see:

2026-05-20T10:13:00.390-0400 [WARN] agent: (view) kv.block(nomad/name): Get "http://localhost:8500/v1/kv/nomad/name?index=72&stale=&wait=300000ms": dial tcp [::1]:8500: connect: connection refused (retry attempt 1 after "50ms")
2026-05-20T10:13:00.390-0400 [ERROR] agent: (runner) sending server error back to caller
2026-05-20T10:13:00.441-0400 [WARN] agent: (view) kv.block(nomad/name): Get "http://localhost:8500/v1/kv/nomad/name?index=72&stale=&wait=300000ms": dial tcp [::1]:8500: connect: connection refused (retry attempt 2 after "100ms")

Then restart Consul and update the the KV: consul kv put "nomad/name" Austin. This will not be updated if you curl the allocation again.

With this patch, the task instead is killed.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation strictly speaking this doesn't need docs, but I think I'll open a PR explaining the difference between template.change_mode = "noop" vs template.once = true which comes into play here

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@tgross tgross added theme/template type/bug backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/ent/1.11.x+ent backport to 1.11.x+ent release line backport/2.0.x backport to 2.0.x release line labels May 20, 2026
@tgross tgross force-pushed the NMD1487-fatal-error-template-noop branch from 876b7e2 to 5ec41b4 Compare May 20, 2026 15:09
If all the templates in a task have `change_mode = "noop"`, the task template
manager returns after the initial rendering under the assumption that we no
longer need to monitor the template runner. But if a template with a Consul or
Vault dependency loses contact with the upstream service for long enough for the
`client.template.consul_retry` or `client.template.vault_retry` to expire, the
template runner exits. If a task is monitoring its own template contents and not
relying on `change_mode`, the task ends up with stale content, and its up to
applications to handle that stale content safely.

But this presents a problem when connectivity is restored, because now the
template runner has exited and will never be restarted. So the application that
may have been able to handle stale content for the configured `consul_retry` or
`vault_retry` duration will no loner get template updates, rather than being
killed and restarted. This turns a "loud" outage that's visible to Nomad into a
"silent" outage that's only visible to the application, which is generally a bad
situation.

Drop the optimization that stops the monitoring of the template runner.

Fixes: https://hashicorp.atlassian.net/browse/NMD-1487
@tgross tgross force-pushed the NMD1487-fatal-error-template-noop branch from 5ec41b4 to 9cd1469 Compare May 20, 2026 15:18
@tgross tgross added this to the 2.0.x milestone May 20, 2026
@tgross tgross marked this pull request as ready for review May 20, 2026 17:38
@tgross tgross requested review from a team as code owners May 20, 2026 17:38
@tgross tgross requested review from gulducat, pkazmierczak and tehut May 20, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.10.x+ent backport to 1.10.x+ent release line backport/ent/1.11.x+ent backport to 1.11.x+ent release line backport/2.0.x backport to 2.0.x release line theme/template type/bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants