Skip to content
This repository was archived by the owner on Oct 10, 2023. It is now read-only.
This repository was archived by the owner on Oct 10, 2023. It is now read-only.

kapp-controller api-service is unable to handle requests causing tanzu managment-cluster create failures #4507

@ridaz

Description

@ridaz

Bug description

tanzu management-cluster create and delete fails with timeout due to tanzu-addons-manager-controller
crash loop caused by inability to reach kapp-controller's apiserver for data.packaging.carvel.dev/v1alpha1

We have seen a few cases where tanzu management-cluster create and delete fails in the following way:

  1. tanzu management-cluster create/ delete times out
  2. tanzu-addons-controller-manager is not fully running (get pods shows 0/n pods running)
  3. logs for tanzu-addons-controller-manager show errors about data.packaging.carvel.dev/v1alpha1:
I0901 23:06:15.856616       1 request.go:645] Throttling request took 1.007519759s, request: GET:https://[fd00:100:96::1]:443/apis/data.packaging.carvel.dev/v1alpha1?timeout=32s
E0901 23:06:19.163153       1 addon_controller.go:188] controllers/Addon "msg"="error retrieving GroupVersion" "error"="the server is currently unable to handle the request"  "GroupVersion"="data.packaging.carvel.dev/v1alpha1"

*This issue looks similar to #571
*

### Workaround:

  1. Restart of kapp-controller and tkr-validator pod helped to get the create/delete operations pass through.
kapp-controller-6d864cc846-kxgdc           2/2     Running   0          3m38s
tkr-conversion-webhook-manager-774f74f64c-f2ngr          1/1     Running   0                 78s
  1. This also helped get the tanzu-addons-controller-manager back to Running state.

Affected product area (please put an X in all that apply)

  • ( ) APIs
  • (X) Addons
  • ( ) CLI
  • ( ) Docs
  • ( ) IAM
  • ( ) Installation
  • ( ) Plugin
  • ( ) Security
  • (X) Test and Release
  • ( ) User Experience
  • ( ) Developer Experience

Expected behavior

tanzu management-cluster create/delete succeeds eventually.

Steps to reproduce the bug

The issue is intermittent, so no specific steps are available to reproduce the issue.

Version (include the SHA if the version is not obvious)
kapp-controller: v0.41.2

Environment where the bug was observed (cloud, OS, etc)
VMC on Nitros

Relevant Debug Output (Logs, manifests, etc)

  1. kubectl get packagerepositories.packaging.carvel.dev utkg-packages-repo -n vmware-system-pkgs -o yaml | less
usefulErrorMessage: |-
I0316 20:55:53.701753   38347 request.go:601] Waited for 1.031315546s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/addons.cluster.x-k8s.io/v1alpha3
I0316 20:56:04.901721   38347 request.go:601] Waited for 1.020640753s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/netoperator.vmware.com/v1alpha1
I0316 20:56:16.368738   38347 request.go:601] Waited for 1.023073347s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1beta1
I0316 20:56:27.834792   38347 request.go:601] Waited for 1.019206685s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1alpha3
I0316 20:56:39.302278   38347 request.go:601] Waited for 1.005437717s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/crd.projectcalico.org/v1
  kapp: Error: unable to retrieve the complete list of server APIs: data.packaging.carvel.dev/v1alpha1: the server is currently unable to handle the request (possibly related issue: https://github.com/vmware-tanzu/carvel-kapp/issues/12)
  1. addons controller
I0316 21:29:23.950689       1 logr.go:252] clusterbootstrap-resource "msg"="validate create"  "name"="v1.23.15---vmware.1-tkg.4"
W0316 21:29:29.030873       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.23.5/tools/cache/reflector.go:167: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refused
E0316 21:29:29.030907       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.23.5/tools/cache/reflector.go:167: Failed to watch *v1alpha1.TanzuKubernetesRelease: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refusedin addons controller
  1. tkr-conversion webhook
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.217:57968: EOF
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.216:53886: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.217:41790: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.216:35878: EOF
  1. packages CR is missing from the environment
kubectl get packages
error: the server doesn't have a resource type "packages"

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/addonsarea/lcmRelated to Cluster Lifecycle managementkind/bugPR/Issue related to a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions