Skip to content

fix(cli): stop managed EC2 demo starting zero containers (#298)#302

Merged
kylehounslow merged 3 commits into
opensearch-project:mainfrom
kylehounslow:fix/managed-zero-containers
Jun 17, 2026
Merged

fix(cli): stop managed EC2 demo starting zero containers (#298)#302
kylehounslow merged 3 commits into
opensearch-project:mainfrom
kylehounslow:fix/managed-zero-containers

Conversation

@kylehounslow

@kylehounslow kylehounslow commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What

Fixes #298. A managed-mode npx @opensearch-project/observability-stack deploy provisioned AWS resources fine but started zero containers. docker compose up aborted at project validation with service "example-agent-eval-canary" depends on undefined service "opensearch": invalid compose project, so no telemetry ever reached OpenSearch.

Root cause

Two independent design choices combined into a version-skew bug.

  1. Managed mode defines no local backend. The generated docker-compose.managed.yml includes docker-compose.examples.yml plus docker-compose.otel-demo.yml but defines only otel-collector. Telemetry ships to the remote OSIS pipeline via SigV4, so there is no local opensearch/prometheus.
  2. The EC2 user-data cloned the stack at main HEAD with no pinned ref. A "pinned" CLI release still deployed whatever main looked like the day the instance booted.

Two services in the included files (example-agent-eval-canary, otel-demo-alerting-rules-monitors-init) depends_on a local backend. They were added to main after the CLI was tagged (cli-installer-v0.1.1, 2026-04-09), so the same CLI binary that worked in April started failing once those services landed. When a depends_on target isn't defined, Compose rejects the whole project.

Both services hardcode a local backend with no managed path (https://opensearch:9200 basic-auth; http://prometheus:9090 Cortex), so they don't belong in managed mode regardless.

Fix

  • Gate both services behind a local-backends compose profile. Compose prunes a profiled-out service and its depends_on edges, so managed mode validates clean while the local path is unchanged.
  • Default COMPOSE_PROFILES=local-backends in committed .env so a raw docker compose up keeps these services on. Managed user-data exports an empty COMPOSE_PROFILES to prune them.
  • Pin the EC2 stack clone to cli-installer-v<version> (override via OBS_STACK_REF), falling back to the default branch if the ref is absent.
  • Bump the CLI to 0.1.2 so the pin resolves to a tag that contains the gate (see release dependency below).

Release dependency (important)

The profile gate is the actual fix; the clone pin is hardening. But the two interact through the CLI version, and the ordering matters:

  • The published CLI pins the clone to cli-installer-v${package.json version}.
  • At 0.1.1, that tag predates both the offending services and this gate. So a 0.1.1 deploy avoids the crash only by cloning an April-vintage stack, the gate never runs, and users silently get a months-old stack.
  • The gate only executes once the CLI clones a tag built from main with this PR.

This PR bumps package.json to 0.1.2 so the path is: merge, then a maintainer pushes the cli-installer-v0.1.2 tag (gated behind the release-approval environment), which publishes to npm. A 0.1.2 deploy then clones cli-installer-v0.1.2, which has the gate. Until that tag is cut, the fix is not live for npx users.

Tradeoff worth noting

Pinning the clone means stack fixes won't reach already-pinned CLI deploys until a new CLI tag ships. That's correctness over freshness, the right call for a reproducible demo installer, but worth stating.

Validation

  • Managed mode (COMPOSE_PROFILES= empty): docker compose -f docker-compose.managed.yml config resolves 30 services, exit 0, both offenders pruned.
  • Local mode via root docker-compose.yml with profile active: 40 services, both offenders and both backends present.
  • Local mode with profile unset: offenders pruned (confirms the gate).
  • Reproduced the original failure and the fix on a real docker compose run: un-gated → invalid compose project (zero containers); gated with empty profiles → otel-collector starts (Everything is ready, GRPC 4317 / HTTP 4318).
  • CLI unit tests: 29/29 pass, including a new managed-mode test that runs docker compose config on the generated docker-compose.managed.yml and asserts the offenders are pruned.

Why CI didn't catch the original bug

Both e2e jobs exercise the local path only. e2e-compose runs root-compose where opensearch/prometheus exist, so the depends_on edges resolve. e2e-install runs install.sh with the local target and declines examples/otel-demo, so the offending files aren't even included, and the managed branch (ec2-demo.mjs) is never entered. Nothing validated the generated managed compose project. The new CLI test closes that gap at the config level.

A full managed npx deploy against live AWS/OSIS has not been run for this branch.

Follow-up (separate PR, not blocking this one)

OSIS data-plane warm-up: control-plane Status=ACTIVE does not mean the ingest endpoint accepts traffic. For several minutes after, the collector logs a wall of failed to make an HTTP request: ...: EOF on every export and OpenSearch stays empty, so a fresh deploy reads as broken even though it self-heals. Proposed fix: after ACTIVE, poll the ingest endpoint until it returns any HTTP response (401 to an unsigned probe means the data plane is up), time-boxed, warn and continue. Tracked separately.

TODO before marking ready for review

  • After merge, cut the cli-installer-v0.1.2 tag so the published CLI clones a stack containing the gate. Without it, the fix is committed but not live for npx users.

Managed mode wrote a docker-compose.managed.yml that includes
docker-compose.examples.yml and docker-compose.otel-demo.yml but defines
no local opensearch/prometheus (telemetry ships to OSIS via SigV4). Two
services in those files depend on a local backend, so Compose rejected
the whole project at validation and started nothing.

- Gate example-agent-eval-canary and otel-demo-alerting-rules-monitors-init
  behind a "local-backends" compose profile. Compose prunes a profiled-out
  service and its depends_on edges, so managed mode validates clean while
  the local path is unchanged.
- Default COMPOSE_PROFILES=local-backends in committed .env so raw
  `docker compose up` keeps these services; managed user-data exports an
  empty COMPOSE_PROFILES to prune them.
- Pin the EC2 stack clone to cli-installer-v<version> (override via
  OBS_STACK_REF), falling back to the default branch if the ref is absent,
  so a pinned CLI no longer deploys whatever main HEAD looks like at boot.

Validated: managed `compose config` resolves 30 services with both
offenders pruned; local root-compose run includes both offenders and the
backends. 28/28 CLI unit tests pass.

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
- Add a managed-mode regression test that extracts the docker-compose.managed.yml
  heredoc from buildUserData output and runs `docker compose config`, asserting
  the project resolves with COMPOSE_PROFILES empty and the two backend-coupled
  services are pruned. Skips when docker compose is unavailable.
- Bump package.json to 0.1.2 so the published CLI pins the EC2 stack clone to a
  tag that contains the profile gate. At 0.1.1 the pin resolves to a tag predating
  both the offending services and this fix, so the gate would never execute.
- Dedupe the profile rationale into the .env comment; reduce per-service comments
  to a one-line pointer. Trim PR-narrative phrasing from the user-data comment.

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Per review: remove narrative tails and rationale duplicated across files.
- .env keeps the canonical "why" for the profile; other comments point to it.
- clone-block comment references STACK_REF instead of restating the drift rationale.
- trim effect-restating tails from the COMPOSE_PROFILES export and test header.

Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
@kylehounslow kylehounslow marked this pull request as ready for review June 17, 2026 19:24

@joshuali925 joshuali925 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@kylehounslow kylehounslow merged commit fa03540 into opensearch-project:main Jun 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Managed-mode EC2 demo starts zero containers: "depends on undefined service opensearch"

2 participants