Skip to content

ci: retry backup manager downloads with shell loop#6957

Open
tennix wants to merge 3 commits into
pingcap:release-1.xfrom
tennix:ci/backup-manager-download-retry-release-1.x
Open

ci: retry backup manager downloads with shell loop#6957
tennix wants to merge 3 commits into
pingcap:release-1.xfrom
tennix:ci/backup-manager-download-retry-release-1.x

Conversation

@tennix

@tennix tennix commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

  • retry backup-manager rclone downloads with an explicit shell loop
  • retry backup-manager shush downloads with the same loop
  • keep wget timeout/retry-connrefused options, but retry any non-zero wget exit (including SSL connection failures)

Motivation

PR #6919 e2e jobs failed while building images/tidb-backup-manager/Dockerfile.e2e:

Unable to establish SSL connection.
ERROR: failed to build: failed to solve: process "... wget ... rclone-v1.71.2-linux-amd64.zip ..." did not complete successfully: exit code: 4

wget --tries did not cover this failure mode reliably in Prow, so this change wraps the download in an explicit retry loop.

Test Plan

  • git diff --check
  • Extracted the Dockerfile RUN shell snippets and validated them with sh -n

Docker build was not run locally because Docker Desktop daemon is not running in this environment.

@ti-chi-bot ti-chi-bot Bot requested a review from shonge June 22, 2026 08:56
@ti-chi-bot

ti-chi-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bornchanger for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/S label Jun 22, 2026
@ti-chi-bot ti-chi-bot Bot added size/M and removed size/S labels Jun 22, 2026
@tennix

tennix commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Follow-up fix pushed in b472a6ec0: the latest Prow failures had moved to tests/images/e2e/Dockerfile, where Helm download failed with curl: (35) Recv failure: Connection reset by peer.

This commit adds explicit retry loops for the e2e image downloads too:

  • kubectl
  • helm
  • awscli
  • cert-manager.yaml

Validation:

  • git diff --check
  • extracted Dockerfile RUN snippets and checked them with sh -n
  • local Docker build of tests/images/e2e with dummy test binaries succeeded:
    DOCKER_BUILDKIT=1 docker build --no-cache --progress=plain --platform linux/amd64 --build-arg=TARGETARCH=amd64 -t tidb-operator-e2e-retry-test:local tests/images/e2e

I also attempted a local build of images/tidb-backup-manager/Dockerfile.e2e with a dummy backup-manager binary; it progressed through the retry-loop download layer, but the full build timed out locally while pulling large external TiDB images.

@tennix

tennix commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

Another follow-up pushed in b6edaa60a: the latest Prow run showed the explicit retry loop working, but all 5 attempts to https://get.helm.sh/helm-v3.11.0-linux-amd64.tar.gz failed with curl: (28) SSL connection timeout in the Prow Docker build.

This commit keeps get.helm.sh as the primary source, but adds a fallback mirror for the Helm tarball:

https://mirrors.huaweicloud.com/helm/${HELM_VERSION}/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz

Validation:

  • git diff --check
  • extracted Dockerfile RUN snippets and checked them with sh -n
  • verified the fallback mirror endpoint returns the expected Helm v3.11.0 linux-amd64 tarball headers/content metadata locally

@ti-chi-bot

ti-chi-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@tennix: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-e2e-kind-scale-simultaneously b6edaa6 link false /test pull-e2e-kind-scale-simultaneously
pull-e2e-kind-tngm b6edaa6 link false /test pull-e2e-kind-tngm
pull-e2e-kind-serial b6edaa6 link false /test pull-e2e-kind-serial
pull-e2e-kind-basic b6edaa6 link false /test pull-e2e-kind-basic
pull-e2e-kind-dmcluster b6edaa6 link false /test pull-e2e-kind-dmcluster
pull-e2e-kind-br b6edaa6 link false /test pull-e2e-kind-br
pull-e2e-kind-across-kubernetes b6edaa6 link false /test pull-e2e-kind-across-kubernetes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant