fix(cli): stop managed EC2 demo starting zero containers (#298)#302
Merged
kylehounslow merged 3 commits intoJun 17, 2026
Merged
Conversation
Managed mode wrote a docker-compose.managed.yml that includes docker-compose.examples.yml and docker-compose.otel-demo.yml but defines no local opensearch/prometheus (telemetry ships to OSIS via SigV4). Two services in those files depend on a local backend, so Compose rejected the whole project at validation and started nothing. - Gate example-agent-eval-canary and otel-demo-alerting-rules-monitors-init behind a "local-backends" compose profile. Compose prunes a profiled-out service and its depends_on edges, so managed mode validates clean while the local path is unchanged. - Default COMPOSE_PROFILES=local-backends in committed .env so raw `docker compose up` keeps these services; managed user-data exports an empty COMPOSE_PROFILES to prune them. - Pin the EC2 stack clone to cli-installer-v<version> (override via OBS_STACK_REF), falling back to the default branch if the ref is absent, so a pinned CLI no longer deploys whatever main HEAD looks like at boot. Validated: managed `compose config` resolves 30 services with both offenders pruned; local root-compose run includes both offenders and the backends. 28/28 CLI unit tests pass. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
- Add a managed-mode regression test that extracts the docker-compose.managed.yml heredoc from buildUserData output and runs `docker compose config`, asserting the project resolves with COMPOSE_PROFILES empty and the two backend-coupled services are pruned. Skips when docker compose is unavailable. - Bump package.json to 0.1.2 so the published CLI pins the EC2 stack clone to a tag that contains the profile gate. At 0.1.1 the pin resolves to a tag predating both the offending services and this fix, so the gate would never execute. - Dedupe the profile rationale into the .env comment; reduce per-service comments to a one-line pointer. Trim PR-narrative phrasing from the user-data comment. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Per review: remove narrative tails and rationale duplicated across files. - .env keeps the canonical "why" for the profile; other comments point to it. - clone-block comment references STACK_REF instead of restating the drift rationale. - trim effect-restating tails from the COMPOSE_PROFILES export and test header. Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes #298. A managed-mode
npx @opensearch-project/observability-stackdeploy provisioned AWS resources fine but started zero containers.docker compose upaborted at project validation withservice "example-agent-eval-canary" depends on undefined service "opensearch": invalid compose project, so no telemetry ever reached OpenSearch.Root cause
Two independent design choices combined into a version-skew bug.
docker-compose.managed.ymlincludesdocker-compose.examples.ymlplusdocker-compose.otel-demo.ymlbut defines onlyotel-collector. Telemetry ships to the remote OSIS pipeline via SigV4, so there is no localopensearch/prometheus.mainHEAD with no pinned ref. A "pinned" CLI release still deployed whatevermainlooked like the day the instance booted.Two services in the included files (
example-agent-eval-canary,otel-demo-alerting-rules-monitors-init)depends_ona local backend. They were added tomainafter the CLI was tagged (cli-installer-v0.1.1, 2026-04-09), so the same CLI binary that worked in April started failing once those services landed. When adepends_ontarget isn't defined, Compose rejects the whole project.Both services hardcode a local backend with no managed path (
https://opensearch:9200basic-auth;http://prometheus:9090Cortex), so they don't belong in managed mode regardless.Fix
local-backendscompose profile. Compose prunes a profiled-out service and itsdepends_onedges, so managed mode validates clean while the local path is unchanged.COMPOSE_PROFILES=local-backendsin committed.envso a rawdocker compose upkeeps these services on. Managed user-data exports an emptyCOMPOSE_PROFILESto prune them.cli-installer-v<version>(override viaOBS_STACK_REF), falling back to the default branch if the ref is absent.0.1.2so the pin resolves to a tag that contains the gate (see release dependency below).Release dependency (important)
The profile gate is the actual fix; the clone pin is hardening. But the two interact through the CLI version, and the ordering matters:
cli-installer-v${package.json version}.0.1.1, that tag predates both the offending services and this gate. So a0.1.1deploy avoids the crash only by cloning an April-vintage stack, the gate never runs, and users silently get a months-old stack.mainwith this PR.This PR bumps
package.jsonto0.1.2so the path is: merge, then a maintainer pushes thecli-installer-v0.1.2tag (gated behind therelease-approvalenvironment), which publishes to npm. A0.1.2deploy then clonescli-installer-v0.1.2, which has the gate. Until that tag is cut, the fix is not live fornpxusers.Tradeoff worth noting
Pinning the clone means stack fixes won't reach already-pinned CLI deploys until a new CLI tag ships. That's correctness over freshness, the right call for a reproducible demo installer, but worth stating.
Validation
COMPOSE_PROFILES=empty):docker compose -f docker-compose.managed.yml configresolves 30 services, exit 0, both offenders pruned.docker-compose.ymlwith profile active: 40 services, both offenders and both backends present.docker composerun: un-gated →invalid compose project(zero containers); gated with empty profiles →otel-collectorstarts (Everything is ready, GRPC 4317 / HTTP 4318).docker compose configon the generateddocker-compose.managed.ymland asserts the offenders are pruned.Why CI didn't catch the original bug
Both e2e jobs exercise the local path only.
e2e-composeruns root-compose whereopensearch/prometheusexist, so thedepends_onedges resolve.e2e-installrunsinstall.shwith the local target and declines examples/otel-demo, so the offending files aren't even included, and the managed branch (ec2-demo.mjs) is never entered. Nothing validated the generated managed compose project. The new CLI test closes that gap at theconfiglevel.A full managed
npxdeploy against live AWS/OSIS has not been run for this branch.Follow-up (separate PR, not blocking this one)
OSIS data-plane warm-up: control-plane
Status=ACTIVEdoes not mean the ingest endpoint accepts traffic. For several minutes after, the collector logs a wall offailed to make an HTTP request: ...: EOFon every export and OpenSearch stays empty, so a fresh deploy reads as broken even though it self-heals. Proposed fix: afterACTIVE, poll the ingest endpoint until it returns any HTTP response (401 to an unsigned probe means the data plane is up), time-boxed, warn and continue. Tracked separately.TODO before marking ready for review
cli-installer-v0.1.2tag so the published CLI clones a stack containing the gate. Without it, the fix is committed but not live fornpxusers.