template: continue monitoring runner even with change_mode=noop#28016
Open
tgross wants to merge 1 commit into
Open
template: continue monitoring runner even with change_mode=noop#28016tgross wants to merge 1 commit into
tgross wants to merge 1 commit into
Conversation
876b7e2 to
5ec41b4
Compare
If all the templates in a task have `change_mode = "noop"`, the task template manager returns after the initial rendering under the assumption that we no longer need to monitor the template runner. But if a template with a Consul or Vault dependency loses contact with the upstream service for long enough for the `client.template.consul_retry` or `client.template.vault_retry` to expire, the template runner exits. If a task is monitoring its own template contents and not relying on `change_mode`, the task ends up with stale content, and its up to applications to handle that stale content safely. But this presents a problem when connectivity is restored, because now the template runner has exited and will never be restarted. So the application that may have been able to handle stale content for the configured `consul_retry` or `vault_retry` duration will no loner get template updates, rather than being killed and restarted. This turns a "loud" outage that's visible to Nomad into a "silent" outage that's only visible to the application, which is generally a bad situation. Drop the optimization that stops the monitoring of the template runner. Fixes: https://hashicorp.atlassian.net/browse/NMD-1487
5ec41b4 to
9cd1469
Compare
schmichael
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
If all the templates in a task have
change_mode = "noop", the task template manager returns after the initial rendering under the assumption that we no longer need to monitor the template runner. But if a template with a Consul or Vault dependency loses contact with the upstream service for long enough for theclient.template.consul_retryorclient.template.vault_retryto expire, the template runner exits. If a task is monitoring its own template contents and not relying onchange_mode, the task ends up with stale content, and its up to applications to handle that stale content safely.But this presents a problem when connectivity is restored, because now the template runner has exited and will never be restarted. So the application that may have been able to handle stale content for the configured
consul_retryorvault_retryduration will no loner get template updates, rather than being killed and restarted. This turns a "loud" outage that's visible to Nomad into a "silent" outage that's only visible to the application, which is generally a bad situation.Drop the optimization that stops the monitoring of the template runner.
Fixes: https://hashicorp.atlassian.net/browse/NMD-1487
Testing & Reproduction steps
In addition to the unit test I've added here, which fails without the patch, you can reproduce this with a single Consul and Nomad node. (Note this same issue can appear with Vault too, but this is a little easier to setup.)
Start Consul and Nomad. Nomad can be in dev mode but Consul should be in normal mode so that you can restart it. Use the following in the Nomad client
templateconfiguration:Add data to Consul KV:
consul kv put "nomad/name" Tim. After configuring Consul and Nomad ACLs so that Nomad tasks can read from thenomad/KV prefix, run the following job:jobspec
You can curl the allocation's port to see the KV data:
Then stop Consul and wait to see:
Then restart Consul and update the the KV:
consul kv put "nomad/name" Austin. This will not be updated if you curl the allocation again.With this patch, the task instead is killed.
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
template.change_mode = "noop"vstemplate.once = truewhich comes into play hereReviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.