Bug description
tanzu management-cluster create and delete fails with timeout due to tanzu-addons-manager-controller
crash loop caused by inability to reach kapp-controller's apiserver for data.packaging.carvel.dev/v1alpha1
We have seen a few cases where tanzu management-cluster create and delete fails in the following way:
- tanzu management-cluster create/ delete times out
- tanzu-addons-controller-manager is not fully running (get pods shows 0/n pods running)
- logs for tanzu-addons-controller-manager show errors about data.packaging.carvel.dev/v1alpha1:
I0901 23:06:15.856616 1 request.go:645] Throttling request took 1.007519759s, request: GET:https://[fd00:100:96::1]:443/apis/data.packaging.carvel.dev/v1alpha1?timeout=32s
E0901 23:06:19.163153 1 addon_controller.go:188] controllers/Addon "msg"="error retrieving GroupVersion" "error"="the server is currently unable to handle the request" "GroupVersion"="data.packaging.carvel.dev/v1alpha1"
*This issue looks similar to #571
*
### Workaround:
- Restart of kapp-controller and tkr-validator pod helped to get the create/delete operations pass through.
kapp-controller-6d864cc846-kxgdc 2/2 Running 0 3m38s
tkr-conversion-webhook-manager-774f74f64c-f2ngr 1/1 Running 0 78s
- This also helped get the tanzu-addons-controller-manager back to Running state.
Affected product area (please put an X in all that apply)
- ( ) APIs
- (X) Addons
- ( ) CLI
- ( ) Docs
- ( ) IAM
- ( ) Installation
- ( ) Plugin
- ( ) Security
- (X) Test and Release
- ( ) User Experience
- ( ) Developer Experience
Expected behavior
tanzu management-cluster create/delete succeeds eventually.
Steps to reproduce the bug
The issue is intermittent, so no specific steps are available to reproduce the issue.
Version (include the SHA if the version is not obvious)
kapp-controller: v0.41.2
Environment where the bug was observed (cloud, OS, etc)
VMC on Nitros
Relevant Debug Output (Logs, manifests, etc)
- kubectl get packagerepositories.packaging.carvel.dev utkg-packages-repo -n vmware-system-pkgs -o yaml | less
usefulErrorMessage: |-
I0316 20:55:53.701753 38347 request.go:601] Waited for 1.031315546s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/addons.cluster.x-k8s.io/v1alpha3
I0316 20:56:04.901721 38347 request.go:601] Waited for 1.020640753s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/netoperator.vmware.com/v1alpha1
I0316 20:56:16.368738 38347 request.go:601] Waited for 1.023073347s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1beta1
I0316 20:56:27.834792 38347 request.go:601] Waited for 1.019206685s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1alpha3
I0316 20:56:39.302278 38347 request.go:601] Waited for 1.005437717s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/crd.projectcalico.org/v1
kapp: Error: unable to retrieve the complete list of server APIs: data.packaging.carvel.dev/v1alpha1: the server is currently unable to handle the request (possibly related issue: https://github.com/vmware-tanzu/carvel-kapp/issues/12)
- addons controller
I0316 21:29:23.950689 1 logr.go:252] clusterbootstrap-resource "msg"="validate create" "name"="v1.23.15---vmware.1-tkg.4"
W0316 21:29:29.030873 1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.5/tools/cache/reflector.go:167: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refused
E0316 21:29:29.030907 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.5/tools/cache/reflector.go:167: Failed to watch *v1alpha1.TanzuKubernetesRelease: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refusedin addons controller
- tkr-conversion webhook
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.217:57968: EOF
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.216:53886: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.217:41790: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.216:35878: EOF
- packages CR is missing from the environment
kubectl get packages
error: the server doesn't have a resource type "packages"
Bug description
tanzu management-cluster create and delete fails with timeout due to
tanzu-addons-manager-controllercrash loop caused by inability to reach kapp-controller's apiserver for
data.packaging.carvel.dev/v1alpha1We have seen a few cases where tanzu management-cluster create and delete fails in the following way:
*This issue looks similar to #571
*
### Workaround:
Affected product area (please put an X in all that apply)
Expected behavior
tanzu management-cluster create/deletesucceeds eventually.Steps to reproduce the bug
The issue is intermittent, so no specific steps are available to reproduce the issue.
Version (include the SHA if the version is not obvious)
kapp-controller: v0.41.2
Environment where the bug was observed (cloud, OS, etc)
VMC on Nitros
Relevant Debug Output (Logs, manifests, etc)