Skip to content

feat: retry with exponential backoff when Octopus Deploy is unavailable#186

Merged
cstrzadala merged 3 commits into
mainfrom
feature/retry-on-octopus-unavailable
Mar 31, 2026
Merged

feat: retry with exponential backoff when Octopus Deploy is unavailable#186
cstrzadala merged 3 commits into
mainfrom
feature/retry-on-octopus-unavailable

Conversation

@cstrzadala
Copy link
Copy Markdown
Contributor

@cstrzadala cstrzadala commented Mar 30, 2026

Summary

  • When Octopus Deploy is unavailable (connection refused, 502/503/504, DNS failure, timeouts), the TeamCity plugin now retries CLI execution with exponential backoff instead of failing the build immediately
  • Includes a fix for a pre-existing macOS Zip Slip false-positive in EmbeddedResourceExtractor

Design

  • OctopusErrorClassifier: distinguishes transient errors (retry) from permanent errors (fail fast) by parsing CLI stdout/stderr patterns for both .NET and Go CLIs
  • OctopusCliRetryExecutor: retry loop with exponential backoff (5s initial, 2m max per attempt, 15m total timeout)
  • CaptureWriter: Output.Writer decorator for capturing CLI output for error classification in the legacy build process path
  • Both legacy (OctopusBuildProcess) and new CLI (CLIBuildProcess) paths use the shared retry executor
  • Configurable via environment variables (no UI changes):
    • OCTOPUS_RETRY_ENABLED (default: true)
    • OCTOPUS_RETRY_TIMEOUT (default: 900000 / 15 min)
    • OCTOPUS_RETRY_INITIAL_DELAY (default: 5000 / 5s)
    • OCTOPUS_RETRY_MAX_DELAY (default: 120000 / 2 min)

Local Testing Results

Tested locally with Docker (TeamCity 2023.11.3 + agent) against an unreachable IP (10.255.255.1:8065) to simulate Octopus Deploy being unavailable.

Legacy .NET CLI (OCTOPUS_NEW_CLI not set, 5min timeout)

The .NET CLI (octo.dll v9.1.7) fails immediately on each connection attempt (~76s TCP timeout), then our plugin retry executor relaunches the CLI process with exponential backoff.

Plugin Attempt Started Failed CLI Duration Backoff
1 04:23:53 04:25:09 76s 2s
2 04:25:12 04:26:28 76s 5s
3 04:26:33 04:27:49 76s 5s
4 04:27:54 04:29:10 76s 5s
5 04:29:15 04:30:31 76s — timeout

Total duration: ~6m38s (with 5min timeout override). Error classifier correctly identified Connection refused as transient. Build log showed retry warning messages between attempts.

Go CLI (OCTOPUS_NEW_CLI=true, default 15min timeout)

The Go CLI (octopus binary) has its own built-in exponential backoff retry within the login command. A single CLI process invocation retried login 10 times internally before exiting.

Go CLI Internal # Login Start Timeout At TCP Duration Backoff to Next
1 04:58:40 04:59:10 30s 5s
2 04:59:16 04:59:46 30s 10s
3 04:59:57 05:00:27 30s 20s
4 05:00:48 05:01:18 30s 40s
5 05:01:58 05:02:28 30s 80s
6 05:03:49 05:04:19 30s 120s
7 05:06:19 05:06:49 30s 120s
8 05:08:49 05:09:19 30s 120s
9 05:11:19 05:11:49 30s 120s
10 05:13:49 05:14:19 30s — CLI exits

Total duration: ~15m39s. The Go CLI consumed the full 15-minute retry window in a single process invocation. After exit, the plugin retry executor classified i/o timeout as transient but the timeout had elapsed, so the build failed.

Key observations

  • Legacy .NET CLI: the plugin retry executor provides all resilience (the CLI itself fails immediately)
  • Go CLI: resilience comes primarily from the CLI's own built-in retry; the plugin retry adds a second layer if the CLI exhausts its internal retries quickly (e.g. instant connection refused vs slow i/o timeout)
  • Permanent errors (e.g. bad API key, missing project) are correctly classified and fail immediately without retrying on both paths
  • Error detection works for both CLI output formats (.NET stack traces and Go error strings)

Test plan

  • Unit tests pass (./gradlew test — 84 tests, all green)
  • Spotless formatting passes
  • Manual verification: legacy CLI retries on unreachable server
  • Manual verification: Go CLI retries on unreachable server

cstrzadala and others added 3 commits March 27, 2026 21:52
…ilable

When the Octopus Deploy server is unavailable (connection refused, 502/503/504,
DNS failure, timeouts), the TeamCity plugin now retries CLI execution with
exponential backoff instead of failing the build immediately.

- OctopusErrorClassifier: distinguishes transient errors (retry) from permanent
  errors (fail fast) by parsing CLI stdout/stderr patterns for both .NET and Go CLIs
- OctopusCliRetryExecutor: retry loop with exponential backoff (5s initial,
  2m max per attempt, 15m total timeout), configurable via env vars
  (OCTOPUS_RETRY_ENABLED, OCTOPUS_RETRY_TIMEOUT, OCTOPUS_RETRY_INITIAL_DELAY,
  OCTOPUS_RETRY_MAX_DELAY)
- CaptureWriter: Output.Writer decorator for capturing CLI output for error
  classification in the legacy build process path
- Both legacy (OctopusBuildProcess) and new CLI (CLIBuildProcess) paths use the
  shared retry executor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The extractTarGzResource() Zip Slip security check compared
destDir.resolve(entry).normalize() against destDir.toRealPath(). On macOS,
/var is a symlink to /private/var, so toRealPath() resolved to
/private/var/folders/... while the entry path kept /var/folders/...,
causing startsWith() to fail and throw a false-positive Zip Slip error.

Fix: resolve destDir to its real path once upfront, then use the resolved
path consistently for both entry resolution and the security check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds Javadoc and inline comments to the retry infrastructure classes
(OctopusCliRetryExecutor, OctopusErrorClassifier, CaptureWriter) and
both build process integration points (OctopusBuildProcess, CLIBuildProcess)
to help engineers unfamiliar with the codebase understand the retry flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cstrzadala cstrzadala marked this pull request as ready for review March 31, 2026 05:22
Copy link
Copy Markdown
Collaborator

@chen-keinan chen-keinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cstrzadala cstrzadala merged commit b643dc1 into main Mar 31, 2026
4 checks passed
@cstrzadala cstrzadala deleted the feature/retry-on-octopus-unavailable branch March 31, 2026 22:19
matt-richardson added a commit that referenced this pull request May 19, 2026
The retry executor added in #186 was not retrying when Octopus Cloud
was in its maintenance window: octo.dll v9.1.7 surfaces the maintenance
HTML page wrapped in OctopusServerException, with an outer message of
"Unable to connect to the Octopus Deploy server" (not "remote server"),
and the underlying HTTP 503 is never plain-text in the CLI output.
OctopusErrorClassifier saw none of its TRANSIENT_PATTERNS match and fell
through to PERMANENT, so retry was skipped and the build failed on the
first attempt.

- broaden the connect pattern to "Unable to connect to .*server" so it
  matches octo.dll's own wrapper wording as well as the System.Net one
- add a "undergoing maintenance" pattern that catches the maintenance
  HTML body (title and copy both contain this phrase)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants