What steps did you take:
[A clear and concise description steps that can be used to reproduce the problem.]
- Run kapp-controller on a cluster with 250+ nodes.
- Deploy an App CR that manages a large number of resources (e.g. DaemonSets or cluster-wide resources that produce one change entry per node in kapp's output).
- Observe the App enters a ReconcileFailed loop and takes ~40 minutes to eventually show Reconcile succeeded.
What happened:
[A small description of the issue]
When kapp deploys a large number of resources it produces verbose stdout — on a 250-node cluster this can reach several MiB. kapp-controller writes this verbatim into status.deploy.stdout and then calls the Kubernetes API to persist the status. etcd rejects objects larger than its hard 2 MiB gRPC message limit, causing the UpdateStatus call to fail.
The deployment itself has succeeded; only the status write fails. On the next reconcile cycle kapp produces a much smaller "no-op" diff that fits within the limit, so the status update eventually goes through — but only after repeated failed retries, resulting in a ~40-minute apparent delay.
What did you expect:
[A description of what was expected]
The status.deploy.stdout (and other output fields such as status.fetch.stdout, status.inspect.stdout, and status.usefulErrorMessage) should be bounded so that the status object never exceeds etcd's 2 MiB limit regardless of cluster size. When output is clipped the field should begin with a clear truncation notice (e.g. [output truncated]\n) so operators immediately know the field is incomplete, rather than silently truncating without any indication.
Anything else you would like to add:
[Additional information that will assist in solving the issue.]
The issue is non-fatal — reconciliation will eventually succeed — but the delay is significant in production (observed: ~40 minutes) and can mask real problems if operators treat ReconcileFailed as a signal to investigate.
The fix is to truncate individual output fields to a reasonable limit (e.g. 1 MiB per field) before writing them into the status struct. Keeping the tail of the output is preferable because the most actionable content (final resource summary, error lines) always appears at the end of kapp output.
Other status string fields that may also grow large in failure scenarios: status.deploy.stderr, status.fetch.stderr, status.usefulErrorMessage
Environment:
- kapp Controller version (execute
kubectl get deployment -n kapp-controller kapp-controller -o yaml and the annotation is kbld.k14s.io/images):any version
- Kubernetes version (use
kubectl version) any version; effect is most pronounced on 250+ node clusters
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
What steps did you take:
[A clear and concise description steps that can be used to reproduce the problem.]
What happened:
[A small description of the issue]
When kapp deploys a large number of resources it produces verbose stdout — on a 250-node cluster this can reach several MiB. kapp-controller writes this verbatim into status.deploy.stdout and then calls the Kubernetes API to persist the status. etcd rejects objects larger than its hard 2 MiB gRPC message limit, causing the UpdateStatus call to fail.
The deployment itself has succeeded; only the status write fails. On the next reconcile cycle kapp produces a much smaller "no-op" diff that fits within the limit, so the status update eventually goes through — but only after repeated failed retries, resulting in a ~40-minute apparent delay.
What did you expect:
[A description of what was expected]
The status.deploy.stdout (and other output fields such as status.fetch.stdout, status.inspect.stdout, and status.usefulErrorMessage) should be bounded so that the status object never exceeds etcd's 2 MiB limit regardless of cluster size. When output is clipped the field should begin with a clear truncation notice (e.g. [output truncated]\n) so operators immediately know the field is incomplete, rather than silently truncating without any indication.
Anything else you would like to add:
[Additional information that will assist in solving the issue.]
The issue is non-fatal — reconciliation will eventually succeed — but the delay is significant in production (observed: ~40 minutes) and can mask real problems if operators treat ReconcileFailed as a signal to investigate.
The fix is to truncate individual output fields to a reasonable limit (e.g. 1 MiB per field) before writing them into the status struct. Keeping the tail of the output is preferable because the most actionable content (final resource summary, error lines) always appears at the end of kapp output.
Other status string fields that may also grow large in failure scenarios: status.deploy.stderr, status.fetch.stderr, status.usefulErrorMessage
Environment:
kubectl get deployment -n kapp-controller kapp-controller -o yamland the annotation iskbld.k14s.io/images):any versionkubectl version) any version; effect is most pronounced on 250+ node clustersVote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.