Skip to content

Commit 5f8af30

Browse files
committed
Merge branch 'fix/npx-upgrade-stale-compose' into 'main'
fix(cli): refresh stale docker-compose.yml on non-git npx upgrade (VM_AUTH wiring) — GA blocker #186 Closes #186 See merge request postgres-ai/postgresai!283
2 parents ed9979b + 961f13b commit 5f8af30

3 files changed

Lines changed: 756 additions & 6 deletions

File tree

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ postgresai mon stop
281281

282282
### Step 3: Pull new Docker images and restart
283283

284-
`mon update` migrates `.env` (adds any newly-required keys) and pulls images while preserving your user-managed `instances.yml`. It does **not** change `PGAI_TAG`, so set the new image tag yourself first — otherwise `mon update` just re-pulls and restarts the *old* version:
284+
`mon update` migrates `.env` (adds any newly-required keys), refreshes `docker-compose.yml` to the new stack version, and pulls images — all while preserving your user-managed `instances.yml`. It does **not** change `PGAI_TAG`, so set the new image tag yourself first — otherwise `mon update` just re-pulls and restarts the *old* version:
285285

286286
```bash
287287
# In your monitoring directory (typically ~/.postgres_ai/), edit .env and set
@@ -293,12 +293,13 @@ postgresai mon start
293293

294294
This will:
295295
- Add any newly required `.env` keys for the newer stack (existing values, your secrets, and `instances.yml` targets are preserved)
296+
- Refresh `docker-compose.yml` to match the new stack version on non-git installs (e.g. npx / `npm install -g`), backing up the old file as `docker-compose.yml.bak-<oldtag>-<hash>` (the original is never overwritten on repeated runs) — this is what wires newly-required service config such as `VM_AUTH_*` on `sink-prometheus`. The fetched compose is validated before it replaces your working one, so a network proxy/login page can never clobber it. Git checkouts already get this via `git pull`, so it is skipped for them.
296297
- Pull the Docker images for the `PGAI_TAG` you set
297298
- Start the services on the new images
298299

299300
> **Note:** The `.env` file contains configuration for the monitoring stack, including `PGAI_TAG` (the Docker image version tag), `REPLICATOR_PASSWORD` (generated password for the demo standby replication user), `VM_AUTH_USERNAME`, `VM_AUTH_PASSWORD`, and optionally `GF_SECURITY_ADMIN_PASSWORD` (Grafana admin password) and `PGAI_REGISTRY` (custom Docker registry). `postgresai mon local-install` preserves existing `REPLICATOR_PASSWORD` and `VM_AUTH_*` values or generates new ones when they are missing; Docker Compose requires these values and does not use known default passwords.
300301
301-
> **In-place upgrade note:** Newer stack versions can require additional `.env` keys (e.g., `VM_AUTH_USERNAME` / `VM_AUTH_PASSWORD` were added in 0.15 for VictoriaMetrics basic auth). Both `postgresai mon local-install -y` and `postgresai mon update` perform a purely-additive `.env` migration on every run: existing values are preserved verbatim, and any newly-required keys are appended with safe random defaults. If you run `docker compose` directly and maintain `.env` yourself, add `VM_AUTH_USERNAME=vmauth` and a non-empty `VM_AUTH_PASSWORD` before upgrading, or run `postgresai mon update-config` once to have the CLI fill them in for you. To rotate the VictoriaMetrics auth password, run `VM_AUTH_PASSWORD="$(openssl rand -base64 18)" ./scripts/rotate-vm-auth.sh` from the monitoring directory; the script updates `.env` and recreates `sink-prometheus` plus `grafana` together so datasource provisioning cannot reinsert stale credentials on restart.
302+
> **In-place upgrade note:** Newer stack versions can require both additional `.env` keys and a matching `docker-compose.yml` (e.g., `VM_AUTH_USERNAME` / `VM_AUTH_PASSWORD` and the `sink-prometheus` auth wiring were added in 0.15 for VictoriaMetrics basic auth). `postgresai mon local-install -y`, `postgresai mon update`, and `postgresai mon update-config` all perform a purely-additive `.env` migration on every run (existing values preserved verbatim; newly-required keys appended with safe random defaults) **and** refresh `docker-compose.yml` to the new stack version on non-git installs (npx / `npm install -g`), backing up the old compose as `docker-compose.yml.bak-<oldtag>-<hash>` (the pristine original is preserved across repeated runs; the fetched compose is validated as a real stack file before it replaces yours). This closes the prior gap where npx upgrades kept a stale 0.14 compose and `sink-prometheus` crashed with `missing "VM_AUTH_USERNAME" env var`. If you run `docker compose` directly and maintain `.env` yourself, add `VM_AUTH_USERNAME=vmauth` and a non-empty `VM_AUTH_PASSWORD` before upgrading, or run `postgresai mon update-config` once to have the CLI fill them in (and refresh the compose) for you. To rotate the VictoriaMetrics auth password, run `VM_AUTH_PASSWORD="$(openssl rand -base64 18)" ./scripts/rotate-vm-auth.sh` from the monitoring directory; the script updates `.env` and recreates `sink-prometheus` plus `grafana` together so datasource provisioning cannot reinsert stale credentials on restart.
302303

303304
**Alternative: Manual upgrade**
304305

cli/bin/postgres-ai.ts

Lines changed: 242 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -590,6 +590,208 @@ async function ensureDefaultMonitoringProject(): Promise<PathResolution> {
590590
return { fs, path, projectDir, composeFile, instancesFile };
591591
}
592592

593+
/**
594+
* Sanitize a PGAI_TAG value before it is interpolated into the compose backup
595+
* filename (`docker-compose.yml.bak-<tag>-<hash8>`). The value flows straight
596+
* into a filename via path.resolve, so it is quote-stripped and validated
597+
* against a conservative charset: a malformed or hostile tag (e.g.
598+
* `../../../tmp/x`, or one carrying a path separator) is rejected to null so it
599+
* can never escape projectDir or otherwise poison the backup path. Callers fall
600+
* back to a timestamp suffix when this returns null.
601+
*
602+
* Applied centrally to BOTH tag sources — the .env read (readDeployedTag) and
603+
* the OLD tag passed in by callers that rewrite .env first (e.g. local-install)
604+
* — so neither path can bypass the validation.
605+
*/
606+
function sanitizeTagForBackup(tag: string | null | undefined): string | null {
607+
if (tag == null) return null;
608+
const stripped = stripMatchingQuotes(tag);
609+
return /^[A-Za-z0-9._-]{1,64}$/.test(stripped) ? stripped : null;
610+
}
611+
612+
/**
613+
* Read the deployed PGAI_TAG out of a project's .env (returns null if absent or
614+
* if the value fails {@link sanitizeTagForBackup}). Used only to compute the
615+
* compose backup file suffix; callers fall back to a timestamp when this is null.
616+
*/
617+
function readDeployedTag(projectDir: string): string | null {
618+
const envFile = path.resolve(projectDir, ".env");
619+
if (!fs.existsSync(envFile)) return null;
620+
const m = fs.readFileSync(envFile, "utf8").match(/^PGAI_TAG=(.+)$/m);
621+
return m ? sanitizeTagForBackup(m[1]) : null;
622+
}
623+
624+
/**
625+
* Validate that a freshly fetched payload is genuinely a postgres_ai
626+
* docker-compose.yml, NOT an HTML login/captcha/proxy/maintenance page that a
627+
* 200 response can still carry. `downloadText` only checks `response.ok`, so a
628+
* non-compose 200 body would otherwise silently clobber a working compose with
629+
* junk — strictly worse than keeping the stale-but-valid one.
630+
*
631+
* Requires:
632+
* - non-empty, and not an obvious HTML document (<!DOCTYPE / <html ...);
633+
* - parseable YAML resolving to an object;
634+
* - a `services` map that contains the keystone `sink-prometheus` service
635+
* (the very service whose VM_AUTH wiring this refresh exists to deliver).
636+
*/
637+
function isValidComposeYaml(text: string): boolean {
638+
const trimmed = text.trim();
639+
if (!trimmed) return false;
640+
// Cheap early reject for HTML error/login/proxy pages served with a 200.
641+
if (/^<(?:!doctype|html|\?xml)\b/i.test(trimmed)) return false;
642+
643+
let doc: unknown;
644+
try {
645+
doc = yaml.load(trimmed);
646+
} catch {
647+
return false;
648+
}
649+
if (!doc || typeof doc !== "object") return false;
650+
const services = (doc as { services?: unknown }).services;
651+
if (!services || typeof services !== "object") return false;
652+
// The keystone service must be present — this is what carries the VM_AUTH_*
653+
// wiring that the whole refresh exists to deliver.
654+
return Object.prototype.hasOwnProperty.call(services, "sink-prometheus");
655+
}
656+
657+
/**
658+
* Fetch the target docker-compose.yml for a given list of candidate refs.
659+
*
660+
* Test-only seam (PGAI_COMPOSE_SOURCE): when NODE_ENV === "test" AND the var is
661+
* set to a local file path, the compose is read from disk instead of fetched
662+
* over the network — keeping the refresh hermetically testable offline without
663+
* exposing a local-file→compose injection surface in a normal user environment.
664+
* The production path (GitLab raw fetch) is otherwise unchanged.
665+
*
666+
* Returns null if nothing could be obtained.
667+
*/
668+
async function fetchTargetCompose(refs: string[]): Promise<string | null> {
669+
const localSource = process.env.PGAI_COMPOSE_SOURCE;
670+
if (process.env.NODE_ENV === "test" && localSource && localSource.trim()) {
671+
try {
672+
return fs.readFileSync(localSource.trim(), "utf8");
673+
} catch {
674+
return null;
675+
}
676+
}
677+
678+
for (const ref of refs) {
679+
const url = `https://gitlab.com/postgres-ai/postgres_ai/-/raw/${encodeURIComponent(ref)}/docker-compose.yml`;
680+
try {
681+
return await downloadText(url);
682+
} catch {
683+
// try next ref
684+
}
685+
}
686+
return null;
687+
}
688+
689+
/**
690+
* Compare two compose documents ignoring trailing-whitespace-only differences,
691+
* so a lone trailing-newline delta doesn't trigger needless churn + a backup.
692+
*/
693+
function composeContentEqual(a: string, b: string): boolean {
694+
return a.replace(/\s+$/, "") === b.replace(/\s+$/, "");
695+
}
696+
697+
/**
698+
* Refresh the bundled, CLI-owned docker-compose.yml for NON-GIT installs when it
699+
* is stale relative to the target stack version (the CLI's own pkg.version).
700+
*
701+
* Why this exists: docker-compose.yml is a version-coupled static asset. 0.15
702+
* added VM basic auth, wiring VM_AUTH_* into the sink-prometheus (VictoriaMetrics)
703+
* service and the Grafana datasource. `mon update` already refreshes the compose
704+
* for git checkouts via `git pull`, and green-field installs fetch it once via
705+
* ensureDefaultMonitoringProject(). But npx / global-npm upgrades are non-git and
706+
* only advance PGAI_TAG — leaving the OLD compose in place. That mismatch crashes
707+
* sink-prometheus (`missing "VM_AUTH_USERNAME" env var`) and blanks all dashboards.
708+
*
709+
* Contract:
710+
* - No-op for git checkouts (.git present) — they refresh via `git pull`.
711+
* - No-op when there is no deployed compose yet (bootstrap path handles it).
712+
* - No-op when the deployed compose content already matches the target.
713+
* - No-op (treated exactly like a fetch failure) when the fetched payload does
714+
* not validate as a real compose — keep the existing compose, no backup,
715+
* warn, return false. Prevents an HTML/proxy 200 body from clobbering a
716+
* working compose with junk.
717+
* - Backs up the prior compose before overwriting and NEVER overwrites an
718+
* existing backup, so the first/pristine compose is always preserved across
719+
* repeated runs (the backup name is uniquified by old-content hash).
720+
* - Touches ONLY docker-compose.yml. Never .env / instances.yml / .pgwatch-config.
721+
* - Best-effort: a fetch/validation failure warns and keeps the existing
722+
* compose (the upgrade still proceeds) — we must not turn a metrics-only
723+
* outage into a hard CLI failure.
724+
*
725+
* @param projectDir monitoring project directory.
726+
* @param oldTag the PGAI_TAG of the *deployed* (pre-upgrade) compose, used to
727+
* label the backup. Callers that rewrite .env's PGAI_TAG before this runs
728+
* (e.g. local-install) MUST pass the captured OLD tag here so the backup
729+
* reflects the OLD version. When omitted, it is read from the project's .env.
730+
* @returns true if the compose was refreshed, false otherwise.
731+
*/
732+
async function refreshBundledComposeIfStale(projectDir: string, oldTag?: string | null): Promise<boolean> {
733+
// Git checkouts manage docker-compose.yml via the repo itself (`git pull`).
734+
if (fs.existsSync(path.resolve(projectDir, ".git"))) return false;
735+
736+
const composeFile = path.resolve(projectDir, "docker-compose.yml");
737+
// Nothing deployed yet -> the green-field bootstrap path handles fetching it.
738+
if (!fs.existsSync(composeFile)) return false;
739+
740+
const refs = [
741+
process.env.PGAI_PROJECT_REF,
742+
pkg.version,
743+
`v${pkg.version}`,
744+
].filter((v): v is string => Boolean(v && v.trim()));
745+
746+
const fetched = await fetchTargetCompose(refs);
747+
// Validate BEFORE doing anything destructive: an empty body, a fetch failure,
748+
// or a non-compose 200 (HTML login/proxy/maintenance page) are all treated
749+
// identically — keep the existing compose, write no backup, warn, no-op.
750+
if (!fetched || !isValidComposeYaml(fetched)) {
751+
console.error(`⚠ Could not refresh docker-compose.yml to ${pkg.version} (no valid compose was retrieved).`);
752+
console.error(" Keeping the existing compose. If dashboards are blank after upgrade, re-run this command once network is available.");
753+
return false;
754+
}
755+
756+
// Compare on-disk CONTENT against the freshly fetched target (whitespace-only
757+
// diffs ignored). This is correct regardless of when PGAI_TAG was rewritten in
758+
// .env (local-install rewrites it before this runs), so we never rely on a
759+
// possibly-stale tag heuristic.
760+
const existing = fs.readFileSync(composeFile, "utf8");
761+
if (composeContentEqual(existing, fetched)) return false; // already current
762+
763+
// Label the backup with the OLD (deployed) tag. Callers that already rewrote
764+
// .env pass it in (raw); otherwise read it from .env. Sanitize centrally so the
765+
// caller-supplied oldTag (e.g. local-install's previousTag) cannot bypass the
766+
// filename validation — a hostile/malformed tag falls back to the timestamp.
767+
const deployedTag = sanitizeTagForBackup(oldTag ?? readDeployedTag(projectDir));
768+
const tagPart = deployedTag ?? new Date().toISOString().replace(/[:.]/g, "-");
769+
// Uniquify with a short hash of the OLD content so repeated runs (e.g.
770+
// update-config, where PGAI_TAG never advances) cannot overwrite the first,
771+
// pristine backup. Always preserve the original compose.
772+
const oldHash = crypto.createHash("sha256").update(existing).digest("hex").slice(0, 8);
773+
const backup = path.resolve(projectDir, `docker-compose.yml.bak-${tagPart}-${oldHash}`);
774+
let backupName: string | null = null;
775+
try {
776+
// "wx" => fail if the backup already exists, so we never clobber a prior
777+
// backup. If this exact old content was already backed up, that's fine —
778+
// the pristine copy is already on disk under the same name.
779+
fs.writeFileSync(backup, existing, { encoding: "utf8", mode: 0o600, flag: "wx" });
780+
backupName = path.basename(backup);
781+
} catch (err) {
782+
const e = err as NodeJS.ErrnoException;
783+
if (e && e.code === "EEXIST") {
784+
// Identical old content already backed up under this name — keep it.
785+
backupName = path.basename(backup);
786+
}
787+
// Any other error: non-fatal, proceed with the refresh even without a backup.
788+
}
789+
fs.writeFileSync(composeFile, fetched, { encoding: "utf8", mode: 0o600 });
790+
const backupNote = backupName ? ` (backup: ${backupName})` : "";
791+
console.log(`✓ Refreshed docker-compose.yml to ${pkg.version}${backupNote}`);
792+
return true;
793+
}
794+
593795
/**
594796
* Get configuration from various sources
595797
* @param opts - Command line options
@@ -2427,10 +2629,15 @@ mon
24272629
let existingReplicatorPassword: string | null = null;
24282630
let existingVmAuthUsername: string | null = null;
24292631
let existingVmAuthPassword: string | null = null;
2632+
// Capture the OLD (deployed) tag BEFORE we rewrite PGAI_TAG below, so the
2633+
// compose-refresh backup is labeled with the version being upgraded FROM.
2634+
let previousTag: string | null = null;
24302635

24312636
if (fs.existsSync(envFile)) {
24322637
const existingEnv = fs.readFileSync(envFile, "utf8");
24332638
// Extract existing values (except tag - always use CLI version)
2639+
const previousTagMatch = existingEnv.match(/^PGAI_TAG=(.+)$/m);
2640+
if (previousTagMatch) previousTag = previousTagMatch[1].trim();
24342641
const registryMatch = existingEnv.match(/^PGAI_REGISTRY=(.+)$/m);
24352642
if (registryMatch) existingRegistry = registryMatch[1].trim();
24362643
const pwdMatch = existingEnv.match(/^GF_SECURITY_ADMIN_PASSWORD=(.+)$/m);
@@ -2463,6 +2670,14 @@ mon
24632670
envLines.push(`VM_AUTH_PASSWORD=${existingVmAuthPassword || crypto.randomBytes(18).toString("base64")}`);
24642671
fs.writeFileSync(envFile, envLines.join("\n") + "\n", { encoding: "utf8", mode: 0o600 });
24652672

2673+
// Non-git upgrade safety: bring the CLI-owned compose up to the target version
2674+
// so newly-required service wiring (e.g. VM_AUTH_* on sink-prometheus) is present.
2675+
// Compares deployed compose CONTENT against the target, so it's correct even
2676+
// though PGAI_TAG was just rewritten above. No-op for git checkouts / when current.
2677+
// Pass the OLD tag (captured before the .env rewrite) so the backup is labeled
2678+
// with the version we're upgrading FROM, not the new one.
2679+
await refreshBundledComposeIfStale(projectDir, previousTag);
2680+
24662681
if (opts.tag) {
24672682
console.log(`Using image tag: ${imageTag}\n`);
24682683
}
@@ -3069,6 +3284,12 @@ mon
30693284
console.log("(existing values were preserved; missing keys filled with safe defaults)\n");
30703285
}
30713286

3287+
// Non-git installs: refresh the CLI-owned compose so it matches the target
3288+
// stack version. Otherwise newly-required service wiring (e.g. VM_AUTH_* on
3289+
// sink-prometheus, added in 0.15) is missing and VictoriaMetrics crashes.
3290+
// No-op for git checkouts and when the compose already matches.
3291+
await refreshBundledComposeIfStale(projectDir);
3292+
30723293
const code = await runCompose(["run", "--rm", "sources-generator"]);
30733294
if (code !== 0) process.exitCode = code;
30743295
});
@@ -3120,7 +3341,14 @@ mon
31203341
const { stdout: pullOut } = await execFilePromise("git", ["pull", "origin", currentBranch]);
31213342
console.log(pullOut);
31223343
} else {
3123-
console.log("(not a git checkout — skipping git fetch/pull and going straight to image pull)");
3344+
// npx / global-npm installs are non-git: `git pull` can't refresh the
3345+
// compose, so bring the CLI-owned docker-compose.yml up to the target
3346+
// version in place (with a backup). This is what wires VM_AUTH_* into
3347+
// sink-prometheus for users upgrading from a pre-0.15 (no-VM-auth) stack.
3348+
// The helper logs only when it actually refreshes/warns, so don't
3349+
// pre-announce a refresh that may turn out to be a no-op.
3350+
console.log("(not a git checkout — checking bundled docker-compose.yml)");
3351+
await refreshBundledComposeIfStale(projectDir);
31243352
}
31253353

31263354
// Step 3: pull new images.
@@ -5068,6 +5296,16 @@ mcp
50685296
}
50695297
});
50705298

5071-
program.parseAsync(process.argv).finally(() => {
5072-
closeReadline();
5073-
});
5299+
// Only parse argv when run as the CLI entrypoint (npx / `bun postgres-ai.ts`).
5300+
// When the module is imported (e.g. unit tests exercising the exported helpers),
5301+
// skip the auto-parse so importing doesn't kick off the whole command tree.
5302+
// `import.meta.main` is honored both under bun and in the node-targeted build.
5303+
if (import.meta.main) {
5304+
program.parseAsync(process.argv).finally(() => {
5305+
closeReadline();
5306+
});
5307+
}
5308+
5309+
// Exported for unit tests (the CLI surface above is unaffected; these are the
5310+
// same functions used by the `mon` commands).
5311+
export { refreshBundledComposeIfStale, readDeployedTag, isValidComposeYaml };

0 commit comments

Comments
 (0)